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Abstract 

In  applications  throughout  science  and  engineering  one  is  often  faced  with  the  challenge  of 
solving  an  ill-posed  inverse  problem,  where  the  number  of  available  measurements  is  smaller 
than  the  dimension  of  the  model  to  be  estimated.  However  in  many  practical  situations  of 
interest,  models  are  constrained  structurally  so  that  they  only  have  a  few  degrees  of  freedom 
relative  to  their  ambient  dimension.  This  paper  provides  a  general  framework  to  convert  notions 
of  simplicity  into  convex  penalty  functions,  resulting  in  convex  optimization  solutions  to  linear, 
underdetermined  inverse  problems.  The  class  of  simple  models  considered  are  those  formed  as 
the  sum  of  a  few  atoms  from  some  (possibly  infinite)  elementary  atomic  set;  examples  include 
well-studied  cases  such  as  sparse  vectors  (e.g.,  signal  processing,  statistics)  and  low-rank  ma¬ 
trices  (e.g.,  control,  statistics),  as  well  as  several  others  including  sums  of  a  few  permutations 
matrices  (e.g.,  ranked  elections,  multiobject  tracking),  low-rank  tensors  (e.g.,  computer  vision, 
neuroscience),  orthogonal  matrices  (e.g.,  machine  learning),  and  atomic  measures  (e.g.,  system 
identification) .  The  convex  programming  formulation  is  based  on  minimizing  the  norm  induced 
by  the  convex  hull  of  the  atomic  set;  this  norm  is  referred  to  as  the  atomic  norm.  The  facial 
structure  of  the  atomic  norm  ball  carries  a  number  of  favorable  properties  that  are  useful  for 
recovering  simple  models,  and  an  analysis  of  the  underlying  convex  geometry  provides  sharp 
estimates  of  the  number  of  generic  measurements  required  for  exact  and  robust  recovery  of 
models  from  partial  information.  These  estimates  are  based  on  computing  the  Gaussian  widths 
of  tangent  cones  to  the  atomic  norm  ball.  When  the  atomic  set  has  algebraic  structure  the 
resulting  optimization  problems  can  be  solved  or  approximated  via  semidefinite  programming. 

The  quality  of  these  approximations  affects  the  number  of  measurements  required  for  recovery, 
and  this  tradeoff  is  characterized  via  some  examples.  Thus  this  work  extends  the  catalog  of  sim¬ 
ple  models  (beyond  sparse  vectors  and  low-rank  matrices)  that  can  be  recovered  from  limited 
linear  information  via  tractable  convex  programming. 
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geometry;  Gaussian  width;  symmetry. 
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1  Introduction 


Deducing  the  state  or  structure  of  a  system  from  partial,  noisy  measurements  is  a  fundamental  task 
throughout  the  sciences  and  engineering.  A  commonly  encountered  difficulty  that  arises  in  such 
inverse  problems  is  the  very  limited  availability  of  data  relative  to  the  ambient  dimension  of  the 
signal  to  be  estimated.  However  many  interesting  signals  or  models  in  practice  contain  few  degrees 
of  freedom  relative  to  their  ambient  dimension.  For  instance  a  small  number  of  genes  may  constitute 
a  signature  for  disease,  very  few  parameters  may  be  required  to  specify  the  correlation  structure  in 
a  time  series,  or  a  sparse  collection  of  geometric  constraints  might  completely  specify  a  molecular 
configuration.  Such  low-dimensional  structure  plays  an  important  role  in  making  inverse  problems 
well-posed.  In  this  paper  we  propose  a  unified  approach  to  transform  notions  of  simplicity  into 
convex  penalty  functions,  thus  obtaining  convex  optimization  formulations  for  inverse  problems. 

We  describe  a  model  as  simple  if  it  can  be  written  as  a  linear  combination  of  a  few  elements 
from  an  atomic  set.  Concretely  let  x  €  Rp  be  formed  as  follows: 

k 

x  =  ^Qa,,  ai  £  A,Ci>0,  (1) 

i=  1 

where  A  is  a  set  of  atoms  that  constitute  simple  building  blocks  of  general  signals.  Here  we 
assume  that  x  is  simple  so  that  k  is  relatively  small.  For  example  A  could  be  the  finite  set  of 
unit-norm  one-sparse  vectors  in  which  case  x  is  a  sparse  vector,  or  A  could  be  the  infinite  set 
of  unit-norm  rank-one  matrices  in  which  case  x  is  a  low-rank  matrix.  These  two  cases  arise  in 
many  applications,  and  have  received  a  tremendous  amount  of  attention  recently  as  several  authors 
have  shown  that  sparse  vectors  and  low-rank  matrices  can  be  recovered  from  highly  incomplete 
information  [13,  21,  22,  54,  14].  However  a  number  of  other  structured  mathematical  objects  also 
fit  the  notion  of  simplicity  described  in  (1).  The  set  A  could  be  the  collection  of  unit-norm  rank- 
one  tensors,  in  which  case  x  is  a  low-rank  tensor  and  we  are  faced  with  the  familiar  challenge  of 
low-rank  tensor  decomposition.  Such  problems  arise  in  numerous  applications  in  computer  vision 
and  image  processing  [1],  and  in  neuroscience  [4].  Alternatively  A  could  be  the  set  of  permutation 
matrices;  sums  of  a  few  permutation  matrices  are  objects  of  interest  in  ranking  [36]  and  multi-object 
tracking.  As  yet  another  example,  A  could  consist  of  measures  supported  at  a  single  point  so  that 
x  is  an  atomic  measure  supported  at  just  a  few  points.  This  notion  of  simplicity  arises  in  problems 
in  system  identification  and  statistics. 

In  each  of  these  examples  as  well  as  several  others,  a  fundamental  problem  of  interest  is  to  recover 
x  given  limited  linear  measurements.  For  instance  the  question  of  recovering  a  sparse  function  over 
the  group  of  permutations  (i.e.,  the  sum  of  a  few  permutation  matrices)  given  linear  measurements 
in  the  form  of  partial  Fourier  information  was  investigated  in  the  context  of  ranked  election  problems 
[36].  Similar  linear  inverse  problems  arise  with  atomic  measures  in  system  identification,  with 
orthogonal  matrices  in  machine  learning,  and  with  simple  models  formed  from  several  other  atomic 
sets  (see  Section  2.2  for  more  examples).  Hence  we  seek  tractable  computational  tools  to  solve 
such  problems.  When  A  is  the  collection  of  one-sparse  vectors,  a  method  of  choice  is  to  use  the  t\ 
norm  to  induce  sparse  solutions.  This  method  has  seen  a  surge  interest  in  the  last  few  years  as  it 
provides  a  tractable  convex  optimization  formulation  to  exactly  recover  sparse  vectors  under  various 
conditions  [13,  21,  22].  More  recently  the  nuclear  norm  has  been  proposed  as  an  effective  convex 
surrogate  for  solving  rank  minimization  problems  subject  to  various  affine  constraints  [54,  14]. 

Motivated  by  the  success  of  these  methods  we  propose  a  general  convex  optimization  frame¬ 
work  in  Section  2  in  order  to  recover  objects  with  structure  of  the  form  (1)  from  limited  linear 
measurements.  The  guiding  question  behind  our  framework  is:  how  do  we  take  a  concept  of  sim¬ 
plicity  such  as  sparsity  and  derive  the  l\  norm  as  a  convex  heuristic?  In  other  words  what  is  the 
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(a)  (b)  (c) 


Figure  1:  Unit  balls  of  some  atomic  norms:  In  each  figure,  the  set  of  atoms  is  graphed  in  red  and 
the  unit  ball  of  the  associated  atomic  norm  is  graphed  in  blue.  In  (a),  the  atoms  are  the  unit- 
Euclidean-norm  one-sparse  vectors,  and  the  atomic  norm  is  the  t\  norm.  In  (b),  the  atoms  are  the 
2x2  symmetric  unit-Euclidean-norm  rank-one  matrices,  and  the  atomic  norm  is  the  nuclear  norm. 
In  (c),  the  atoms  are  the  vectors  {  —  1,  +1}2,  and  the  atomic  norm  is  the  Iqq  norm. 


natural  procedure  to  go  from  the  set  of  one-sparse  vectors  A  to  the  i\  norm?  We  observe  that 
the  convex  hull  of  (unit-Euclidean-norm)  one-sparse  vectors  is  the  unit  ball  of  the  i l\  norm,  or  the 
cross-polytope.  Similarly  the  convex  hull  of  the  (unit-Euclidean-norm)  rank-one  matrices  is  the 
nuclear  norm  ball;  see  Figure  1  for  illustrations.  These  constructions  suggest  a  natural  generaliza¬ 
tion  to  other  settings.  Under  suitable  conditions  the  convex  hull  conv(„4)  defines  the  unit  ball  of 
a  norm,  which  is  called  the  atomic  norm  induced  by  the  atomic  set  A.  We  can  then  minimize  the 
atomic  norm  subject  to  measurement  constraints,  which  results  in  a  convex  programming  heuristic 
for  recovering  simple  models  given  linear  measurements.  As  an  example  suppose  we  wish  to  recover 
the  sum  of  a  few  permutation  matrices  given  linear  measurements.  The  convex  hull  of  the  set  of 
permutation  matrices  is  the  Birkhoff  polytope  of  doubly  stochastic  matrices  [64],  and  our  proposal 
is  to  solve  a  convex  program  that  minimizes  the  norm  induced  by  this  polytope.  Similarly  if  we 
wish  to  recover  an  orthogonal  matrix  from  linear  measurements  we  would  solve  a  spectral  norm 
minimization  problem,  as  the  spectral  norm  ball  is  the  convex  hull  of  all  orthogonal  matrices.  As 
discussed  in  Section  2.5  the  atomic  norm  minimization  problem  is  the  best  convex  heuristic  for 
recovering  simple  models  with  respect  to  a  given  atomic  set. 

We  give  general  conditions  for  exact  and  robust  recovery  using  the  atomic  norm  heuristic.  In 
Section  3  we  provide  concrete  bounds  on  the  number  of  generic  linear  measurements  required  for 
the  atomic  norm  heuristic  to  succeed.  This  analysis  is  based  on  computing  certain  Gaussian  widths 
of  tangent  cones  with  respect  to  the  unit  balls  of  the  atomic  norm  [31].  Arguments  based  on  Gaus¬ 
sian  width  have  been  fruitfully  applied  to  obtain  bounds  on  the  number  of  Gaussian  measurements 
for  the  special  case  of  recovering  sparse  vectors  via  t\  norm  minimization  [56,  59],  but  computing 
Gaussian  widths  of  general  cones  is  not  easy.  Therefore  it  is  important  to  exploit  the  special  struc¬ 
ture  in  atomic  norms,  while  still  obtaining  sufficiently  general  results  that  are  broadly  applicable. 
An  important  theme  in  this  paper  is  the  connection  between  Gaussian  widths  and  various  notions 
of  symmetry.  Specifically  by  exploiting  symmetry  structure  in  certain  atomic  norms  as  well  as  con¬ 
vex  duality  properties,  we  give  bounds  on  the  number  of  measurements  required  for  recovery  using 
very  general  atomic  norm  heuristics.  For  example  we  provide  precise  estimates  of  the  number  of 
generic  measurements  required  for  exact  recovery  of  an  orthogonal  matrix  via  spectral  norm  min¬ 
imization,  and  the  number  of  generic  measurements  required  for  exact  recovery  of  a  permutation 
matrix  by  minimizing  the  norm  induced  by  the  Birkhoff  polytope.  While  these  results  correspond 
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Underlying  model 

Convex  heuristic 

#  Gaussian  measurements 

s-sparse  vector  in  MP 

£i  norm 

(2s  +  1)  log(p  -  s) 

m  x  m  rank -r  matrix 

nuclear  norm 

3r(2m  —  r)  +  2  (m  —  r  —  r2) 

sign- vector  {  — 1,+1}P 

ioo  norm 

p/2 

m  X  m  permutation  matrix 

norm  induced  by  Birkhoff  polytope 

9mlog(m) 

m  X  m  orthogonal  matrix 

spectral  norm 

Table  1:  A  summary  of  the  recovery  bounds  obtained  using  Gaussian  width  arguments. 


to  the  recovery  of  individual  atoms  from  random  measurements,  our  techniques  are  more  generally 
applicable  to  the  recovery  of  models  formed  as  sums  of  a  few  atoms  as  well.  We  also  give  tighter 
bounds  than  those  previously  obtained  on  the  number  of  measurements  required  to  robustly  recover 
sparse  vectors  and  low-rank  matrices  via  i\  norm  and  nuclear  norm  minimization.  In  all  of  the 
cases  we  investigate,  we  find  that  the  number  of  measurements  required  to  reconstruct  an  object 
is  proportional  to  its  intrinsic  dimension  rather  than  the  ambient  dimension,  thus  confirming  prior 
folklore.  See  Table  1  for  a  summary  of  these  results. 

Although  our  conditions  for  recovery  and  bounds  on  the  number  of  measurements  hold  gen¬ 
erally,  we  note  that  it  may  not  be  possible  to  obtain  a  computable  representation  for  the  convex 
hull  conv(A)  of  an  arbitrary  set  of  atoms  A.  This  leads  us  to  another  important  theme  of  this 
paper,  which  we  discuss  in  Section  4,  on  the  connection  between  algebraic  structure  in  A  and 
the  semidefinite  representability  of  the  convex  hull  conv(^l).  In  particular  when  A  is  an  algebraic 
variety  the  convex  hull  conv(^l)  can  be  approximated  as  (the  projection  of)  a  set  defined  by  lin¬ 
ear  matrix  inequalities.  Thus  the  resulting  atomic  norm  minimization  heuristic  can  be  solved  via 
semidefinite  programming.  A  second  issue  that  arises  in  practice  is  that  even  with  algebraic  struc¬ 
ture  in  A  the  semidefinite  representation  of  conv(^l)  may  not  be  computable  in  polynomial  time, 
which  makes  the  atomic  norm  minimization  problem  intractable  to  solve.  A  prominent  example 
here  is  the  tensor  nuclear  norm  ball,  which  is  obtained  by  taking  the  convex  hull  of  the  rank-one 
tensors.  In  order  to  address  this  problem  we  give  a  hierarchy  of  semidefinite  relaxations  using  theta 
bodies  [32],  which  approximate  the  original  (intractable)  atomic  norm  minimization  problem.  A 
third  point  we  highlight  is  that  while  these  semidefinite  relaxations  are  more  tractable  to  solve,  we 
require  more  measurements  for  exact  recovery  of  the  underlying  model  than  if  we  solve  the  original 
intractable  atomic  norm  minimization  problem.  Hence  we  have  a  tradeoff  between  the  complexity 
of  the  recovery  algorithm  and  the  number  of  measurements  required  for  recovery.  We  illustrate  this 
tradeoff  with  the  cut  polytope,  which  is  intractable  to  compute,  and  its  relaxations. 

Outline  Section  2  describes  the  construction  of  the  atomic  norm,  gives  several  examples  of 
applications  in  which  these  norms  may  be  useful  to  recover  simple  models,  and  provides  general 
conditions  for  recovery  by  minimizing  the  atomic  norm.  In  Section  3  we  investigate  the  number 
of  generic  measurements  for  exact  or  robust  recovery  using  atomic  norm  minimization,  and  give 
estimates  in  a  number  of  settings  by  analyzing  the  Gaussian  width  of  certain  tangent  cones.  We 
address  the  problem  of  semidefinite  representability  and  tractable  relaxations  of  the  atomic  norm 
in  Section  4.  Section  5  describes  some  algorithmic  issues  as  well  as  a  few  simulation  results,  and 
we  conclude  with  a  brief  discussion  and  some  open  questions  in  Section  6. 

2  Atomic  Norms  and  Convex  Geometry 

In  this  section  we  describe  the  construction  of  an  atomic  norm  from  a  collection  of  simple  atoms. 
In  addition  we  give  several  examples  of  atomic  norms,  and  discuss  their  properties  in  the  context 
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of  solving  ill-posed  linear  inverse  problems.  We  denote  the  Euclidean  norm  by 


2.1  Definition 

Let  A  be  a  collection  of  atoms  that  is  a  compact  subset  of  Rp.  We  will  assume  throughout  this 
paper  that  no  element  a  €  A  lies  in  the  convex  hull  of  the  other  elements  conv(M\a),  i.e.,  the 
elements  of  A  are  the  extreme  points  of  conv(*4).  Let  ||x||_4  denote  the  gauge  of  A  [55]: 

||x||_4  =  inf{f  >0  :  x  €  t  conv(>4)}.  (2) 

Note  that  the  gauge  is  always  a  convex,  extended-real  valued  function  for  any  set  A.  By  convention 
this  function  evaluates  to  +oo  if  x  does  not  he  in  the  affine  hull  of  conv(M).  We  will  assume 
without  loss  of  generality  that  the  centroid  of  conv(yl)  is  at  the  origin,  as  this  can  be  achieved  by 
appropriate  recentering.  With  this  assumption  the  gauge  function  can  be  rewritten  as: 

||x||_4  =  inf  <  ^2  ca  :  x  =  22  °aa,  ca  >  0  Va  G  A 
l  ae.4  aeA 

with  the  sum  being  replaced  by  an  integral  when  A  is  uncountable.  If  A  is  centrally  symmetric 
about  the  origin  (i.e.,  a  G  A  if  and  only  if  —a  G  A)  we  have  that  ||  •  ||_4  is  a  norm,  which  we  call 
the  atomic  norm  induced  by  A.  The  support  function  of  A  is  given  as: 

||x||^  =  sup  {(x,  a)  :  aei}.  (3) 

If  ||  •  ||_4  is  a  norm  the  support  function  ||  •  ||]^  is  the  dual  norm  of  this  atomic  norm.  From  this 
definition  we  see  that  the  unit  ball  of  ||  •  [|_4  is  equal  to  conv(M).  In  many  examples  of  interest 
the  set  A  is  not  centrally  symmetric,  so  that  the  gauge  function  does  not  define  a  norm.  However 
our  analysis  is  based  on  the  underlying  convex  geometry  of  conv(M),  and  our  results  are  applicable 
even  if  ||  •  ||_4  does  not  define  a  norm.  Therefore,  with  an  abuse  of  terminology  we  generally  refer 
to  ||  •  ||_4  as  the  atomic  norm  of  the  set  A  even  if  ||  •  ||_4  is  not  a  norm.  We  note  that  the  duality 
characterization  between  (2)  and  (3)  when  ||  •  ||_4  is  a  norm  is  in  fact  applicable  even  in  infinite¬ 
dimensional  Banach  spaces  by  Bonsall’s  atomic  decomposition  theorem  [8],  but  our  focus  is  on  the 
finite-dimensional  case  in  this  work.  We  investigate  in  greater  detail  the  issues  of  representability 
and  efficient  approximation  of  these  atomic  norms  in  Section  4. 

Equipped  with  a  convex  penalty  function  given  a  set  of  atoms,  we  propose  a  convex  optimization 
method  to  recover  a  “simple”  model  give  limited  linear  measurements.  Specifically  suppose  that 
x*  is  formed  according  to  (1)  from  a  set  of  atoms  A.  Further  suppose  that  we  have  a  known  linear 
map  $  :  Mp  — >•  Mn,  and  we  have  linear  information  about  x*  as  follows: 

y  =  $x*.  (4) 


The  goal  is  to  reconstruct  x*  given  y.  We  consider  the  following  convex  formulation  to  accomplish 
this  task: 


x  =  argmin  ||x||_4 

X 

s.t.  y  =  4>x. 


(5) 


When  A  is  the  set  of  one-sparse  atoms  this  problem  reduces  to  standard  i\  norm  minimization. 
Similarly  when  A  is  the  set  of  rank-one  matrices  this  problem  reduces  to  nuclear  norm  minimization. 
More  generally  if  the  atomic  norm  ||  •  ||_4  is  tractable  to  evaluate,  then  (5)  potentially  offers  an  efficient 
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convex  programming  formulation  for  reconstructing  x*  from  the  limited  information  y. 
problem  of  (5)  is  given  as  follows: 

T 

max  y  z 

Z 

s.t.  H^zH^l. 


The  dual 

(6) 


Here  denotes  the  adjoint  (or  transpose)  of  the  linear  measurement  map  $. 

The  convex  formulation  (5)  can  be  suitably  modified  in  case  we  only  have  access  to  inaccurate, 
noisy  information.  Specifically  suppose  that  we  have  noisy  measurements  y  =  <hx*  +  lo  where  ui 
represents  the  noise  term.  A  natural  convex  formulation  is  one  in  which  the  constraint  y  =  <f>x  of 
(5)  is  replaced  by  the  relaxed  constraint  ||y  —  $x||  <  5 ,  where  5  is  an  upper  bound  on  the  size  of 
the  noise  u: 


x  =  argmin  ||x||_4 

X 

s.t.  ||y  —  $x||  <  5. 


(7) 


We  say  that  we  have  exact  recovery  in  the  noise- free  case  if  x  =  x*  in  (5),  and  robust  recovery  in 
the  noisy  case  if  the  error  ||x  —  x*||  is  small  in  (7).  In  Section  2.4  and  Section  3  we  give  conditions 
under  which  the  atomic  norm  heuristics  (5)  and  (7)  recover  x*  exactly  or  approximately.  Atomic 
norms  have  found  fruitful  applications  in  problems  in  approximation  theory  of  various  function 
classes  [51,  37,  3,  19].  However  this  prior  body  of  work  was  concerned  with  infinite-dimensional 
Banach  spaces,  and  none  of  these  references  consider  nor  provide  recovery  guarantees  that  are 
applicable  in  our  setting. 


2.2  Examples 

Next  we  provide  several  examples  of  atomic  norms  that  can  be  viewed  as  special  cases  of  the 
construction  above.  These  norms  are  obtained  by  convexifying  atomic  sets  that  are  of  interest  in 
various  applications. 

Sparse  vectors.  The  problem  of  recovering  sparse  vectors  from  limited  measurements  has 
received  a  great  deal  of  attention,  with  applications  in  many  problem  domains.  In  this  case  the 
atomic  set  4cMp  can  be  viewed  as  the  set  of  unit-norm  one-sparse  vectors  {±e.;}T=1,  and  fc-sparse 
vectors  in  Mp  can  be  constructed  using  a  linear  combination  of  k  elements  of  the  atomic  set.  In  this 
case  it  is  easily  seen  that  the  convex  hull  conv(A)  is  given  by  the  cross-polytope  (i.e.,  the  unit  ball 
of  the  t\  norm),  and  the  atomic  norm  ||  •  ||_4  corresponds  to  the  t\  norm  in  Wp. 

Low-rank  matrices.  Recovering  low-rank  matrices  from  limited  information  is  also  a  problem 
that  has  received  considerable  attention  as  it  finds  applications  in  problems  in  statistics,  control, 
and  machine  learning.  The  atomic  set  A  here  can  be  viewed  as  the  set  of  rank-one  matrices  of 
unit-Euclidean-norm.  The  convex  hull  conv(A)  is  the  nuclear  norm  ball  of  matrices  in  which  the 
sum  of  the  singular  values  is  less  than  or  equal  to  one. 

Permutation  matrices.  A  problem  of  interest  in  a  ranking  context  [36]  or  an  object  tracking 
context  is  that  of  recovering  permutation  matrices  from  partial  information.  Suppose  that  a  small 
number  k  of  rankings  of  m  candidates  is  preferred  by  a  population.  Such  preferences  can  be 
modeled  as  the  sum  of  a  few  m  x  rn  permutation  matrices,  with  each  permutation  corresponding 
to  a  particular  ranking.  By  conducting  surveys  of  the  population  one  can  obtain  partial  linear 
information  of  these  preferred  rankings.  The  set  A  here  is  the  collection  of  permutation  matrices 
(consisting  of  ml  elements),  and  the  convex  hull  conv(A)  is  the  Birkhoff  polytope  or  the  set  of 
doubly  stochastic  matrices  [64],  The  centroid  of  the  Birkhoff  polytope  is  the  matrix  11 T /m,  so 
it  needs  to  be  recentered  appropriately.  We  mention  here  recent  work  by  Jagabathula  and  Shah 
[36]  on  recovering  a  sparse  function  over  the  symmetric  group  (i.e.,  the  sum  of  a  few  permutation 
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matrices)  given  partial  Fourier  information;  although  the  algorithm  proposed  in  [36]  is  tractable  it 
is  not  based  on  convex  optimization. 

Binary  vectors.  In  integer  programming  one  is  often  interested  in  recovering  vectors  in  which 
the  entries  take  on  values  of  ±1.  Suppose  that  there  exists  such  a  sign-vector,  and  we  wish  to 
recover  this  vector  given  linear  measurements.  This  corresponds  to  a  version  of  the  multi-knapsack 
problem  [45].  In  this  case  A  is  the  set  of  all  sign- vectors,  and  the  convex  hull  conv(Al)  is  the 
hypercube  or  the  unit  ball  of  the  norm.  The  image  of  this  hypercube  under  a  linear  map  is  also 
referred  to  as  a  zonotope  [64]. 

Vectors  from  lists.  Suppose  there  is  an  unknown  vector  x  G  Rp,  and  that  we  are  given  the 
entries  of  this  vector  without  any  information  about  the  locations  of  these  entries.  For  example  if 
x  =  [3  1  2  2  4]r,  then  we  are  only  given  the  list  of  numbers  {1,  2,  2,  3, 4}  without  their  positions  in 
x.  Further  suppose  that  we  have  access  to  a  few  linear  measurements  of  x.  Can  we  recover  x  by 
solving  a  convex  program?  Such  a  problem  is  of  interest  in  recovering  partial  rankings  of  elements 
of  a  set.  An  extreme  case  is  one  in  which  we  only  have  two  preferences  for  rankings,  i.e.,  a  vector 
in  {1,2}P  composed  only  of  one’s  and  two’s,  which  reduces  to  a  special  case  of  the  problem  above 
of  recovering  binary  vectors  (in  which  the  number  of  entries  of  each  sign  is  fixed).  For  this  problem 
the  set  A  is  the  set  of  all  permutations  of  x  (which  we  know  since  we  have  the  list  of  numbers 
that  compose  x),  and  the  convex  hull  conv(Al)  is  the  permutahedron  [64,  57].  As  with  the  Birkhoff 
polytope,  the  permutahedron  also  needs  to  be  recentered  about  the  point  lTx/p. 

Matrices  constrained  by  eigenvalues.  This  problem  is  in  a  sense  the  non-commutative 
analog  of  the  one  above.  Suppose  that  we  are  given  the  eigenvalues  A  of  a  symmetric  matrix,  but 
no  information  about  the  eigenvectors.  Can  we  recover  such  a  matrix  given  some  additional  linear 
measurements?  In  this  case  the  set  A  is  the  set  of  all  symmetric  matrices  with  eigenvalues  A,  and 
the  convex  hull  conv(Al)  is  given  by  the  Schur-Horn  orbitope  [57]. 

Orthogonal  matrices.  In  many  applications  matrix  variables  are  constrained  to  be  orthogo¬ 
nal,  which  is  a  non-convex  constraint  and  may  lead  to  computational  difficulties.  We  consider  one 
such  simple  setting  in  which  we  wish  to  recover  an  orthogonal  matrix  given  limited  information  in 
the  form  of  linear  measurements.  In  this  example  the  set  A  is  the  set  ofraxm  orthogonal  matrices, 
and  conv(Al)  is  the  spectral  norm  ball. 

Measures.  Recovering  a  measure  given  its  moments  is  another  question  of  interest  that  arises  in 
system  identification  and  statistics.  Suppose  one  is  given  access  to  a  linear  combination  of  moments 
of  an  atomically  supported  measure.  How  can  we  reconstruct  the  support  of  the  measure?  The 
set  A  here  is  the  moment  curve,  and  its  convex  hull  conv(^I)  goes  by  several  names  including  the 
Caratheodory  orbitope  [57].  Discretized  versions  of  this  problem  correspond  to  the  set  A  being  a 
finite  number  of  points  on  the  moment  curve;  the  convex  hull  conv(Al)  is  then  a  cyclic  polytope  [64], 

Cut  matrices.  In  some  problems  one  may  wish  to  recover  low-rank  matrices  in  which  the 
entries  are  constrained  to  take  on  values  of  ±1.  Such  matrices  can  be  used  to  model  basic  user 
preferences,  and  are  of  interest  in  problems  such  as  collaborative  filtering  [58].  The  set  of  atoms 
A  could  be  the  set  of  rank-one  signed  matrices,  i.e.,  matrices  of  the  form  zzT  with  the  entries  of  z 
being  ±1.  The  convex  hull  conv(Al)  of  such  matrices  is  the  cut  polytope  [20].  An  interesting  issue 
that  arises  here  is  that  the  cut  polytope  is  in  general  intractable  to  characterize.  However  there 
exist  several  well-known  tractable  semidefinite  relaxations  to  this  polytope  [20,  30],  and  one  can 
employ  these  in  constructing  efficient  convex  programs  for  recovering  cut  matrices.  We  discuss  this 
point  in  greater  detail  in  Section  4.3. 

Low-rank  tensors.  Low-rank  tensor  decompositions  play  an  important  role  in  numerous 
applications  throughout  signal  processing  and  machine  learning  [40].  Developing  computational 
tools  to  recover  low-rank  tensors  is  therefore  of  great  interest.  In  principle  we  could  solve  a  tensor 
nuclear  norm  minimization  problem,  in  which  the  tensor  nuclear  norm  ball  is  obtained  by  taking 
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the  convex  hull  of  rank-one  tensors.  A  computational  challenge  here  is  that  the  tensor  nuclear  norm 
is  in  general  intractable  to  compute;  in  order  to  address  this  problem  we  discuss  further  convex 
relaxations  to  the  tensor  nuclear  norm  using  theta  bodies  in  Section  4.  A  number  of  additional 
technical  issues  also  arise  with  low-rank  tensors  including  the  non-existence  in  general  of  a  singular 
value  decomposition  analogous  to  that  for  matrices  [39],  and  the  difference  between  the  rank  of  a 
tensor  and  its  border  rank  [18]. 

Nonorthogonal  factor  analysis.  Suppose  that  a  data  matrix  admits  a  factorization  X  =  AB. 
The  matrix  nuclear  norm  heuristic  will  find  a  factorization  into  orthogonal  factors  in  which  the 
columns  of  A  and  rows  of  B  are  mutually  orthogonal.  However  if  a  priori  information  is  available 
about  the  factors,  precision  and  recall  could  be  improved  by  enforcing  such  priors.  These  priors 
may  sacrifice  orthogonality,  but  the  factors  might  better  conform  with  assumptions  about  how  the 
data  are  generated.  For  instance  in  some  applications  one  might  know  in  advance  that  the  factors 
should  only  take  on  a  discrete  set  of  values  [58].  In  this  case,  we  might  try  to  fit  a  sum  of  rank-one 
matrices  that  are  bounded  in  norm  rather  than  in  1 2  norm.  Another  prior  that  commonly  arises 
in  practice  is  that  the  factors  are  non- negative  (i.e.,  in  non-negative  matrix  factorization).  These 
and  other  priors  on  the  basic  rank-one  summands  induce  different  norms  on  low-rank  models  than 
the  standard  nuclear  norm  [27],  and  may  be  better  suited  to  specific  applications. 

2.3  Background  on  Tangent  and  Normal  Cones 

In  order  to  properly  state  our  results,  we  recall  some  basic  concepts  from  convex  analysis.  A  convex 
set  C  is  a  cone  if  it  is  closed  under  positive  linear  combinations.  The  polar  C*  of  a  cone  C  is  the 
cone 

C*  =  {z  €  Rp  :  (. x ,  z)  <  0  Vz  €  C}. 

Given  some  nonzero  x  £  P  we  define  the  tangent  cone  at  x  with  respect  to  the  scaled  unit  ball 
||x||_4Conv(*4)  as 

Ta(x)  =  cone{z  —  x  :  ||z||^  <  ||x||^}.  (8) 

The  cone  T ^(x)  is  equal  to  the  set  of  descent  directions  of  the  atomic  norm  ||  •  ||_4  at  the  point  x, 
i.e.,  the  set  of  all  directions  d  such  that  the  directional  derivative  is  negative. 

The  normal  cone  IV ^(x)  at  x  with  respect  to  the  scaled  unit  ball  |x||_4Conv(A)  is  defined  to  be 
the  set  of  all  directions  s  that  form  obtuse  angles  with  every  descent  direction  of  the  atomic  norm 
II  '  ||>l  at  the  point  x: 


AU(x)  =  {s  :  (s,  z  -  x)  <  0  Vz  s.t.  ||z||^  <  ||x||^}.  (9) 

The  normal  cone  is  equal  to  the  set  of  all  hyperplanes  given  by  normal  vectors  s  that  support  the 
scaled  unit  ball  ||x||_4Conv(A)  at  x.  Observe  that  the  polar  cone  of  the  tangent  cone  T ^(x)  is  the 
normal  cone  TV^fx)  and  vice-versa.  Moreover  we  have  the  following  basic  characterization 

Na{x)  =  cone(d||x|U), 

which  states  that  the  normal  cone  IV 4(x)  is  the  conic  hull  of  the  subdifferential  of  the  atomic  norm 
at  x. 

2.4  Recovery  Condition 

The  following  result  gives  a  characterization  of  the  favorable  underlying  geometry  required  for  exact 
recovery.  Let  null(<3?)  denote  the  nullspace  of  the  operator  4>. 


Proposition  2.1.  We  have  that  x  =  x*  is  the  unique  optimal  solution  of  (5)  if  and  only  if 
null($)  n  Ta(x*)  =  {0}. 

Proof.  Eliminating  the  equality  constraints  in  (5)  we  have  the  equivalent  optimization  problem 

min  ||x*  +  d||_4  s.t.  d  £  null($). 

d 

Suppose  null($)  fl  T^(x*)  =  0.  Since  ||x*  +  d||_4  <  ||x*||  implies  d  £  T_4(x*),  we  have  that 
||x*  +  d||_4  >  ||x*||_4  for  all  d  £  null(<h)  \  {0}.  Conversely  x*  is  the  unique  optimal  solution  of  (5) 
if  ||x*  +  d||_4  >  ||x* ||_4  for  all  d  £  null($)  \  {0},  which  implies  that  d  0  TA(x*).  □ 

Proposition  2.1  asserts  that  the  atomic  norm  heuristic  succeeds  if  the  nullspace  of  the  sampling 
operator  does  not  intersect  the  tangent  cone  T a(x*)  at  x*.  In  Section  3  we  provide  a  characterization 
of  tangent  cones  that  determines  the  number  of  Gaussian  measurements  required  to  guarantee  such 
an  empty  intersection. 

A  tightening  of  this  empty  intersection  condition  can  also  be  used  to  address  the  noisy  approx¬ 
imation  problem.  The  following  proposition  characterizes  when  x*  can  be  well- approximated  using 
the  convex  program  (7). 

Proposition  2.2.  Suppose  that  we  are  given  n  noisy  measurements  y  =  <hx*  +  c o  where  ||w||  <  5, 
and  <f>  :  — >•  Mn.  Let  x  denote  an  optimal  solution  of  (7).  Further  suppose  for  all  z  £  T ^(x*) 
that  we  have  || d>z ||  >  e||z||.  Then  ||x  —  x*||  < 

Proof.  The  set  of  descent  directions  at  x*  with  respect  to  the  atomic  norm  ball  is  given  by  the 
tangent  cone  T 4(x*).  The  error  vector  x  —  x*  lies  in  T 4(x*)  because  x  is  a  minimal  atomic  norm 
solution,  and  hence  ||x||_4  <  ||x*||_4.  It  follows  by  the  triangle  inequality  that 

||$(x  —  x*)||  <  ||$x  —  y||  +  || $x*  —  y ||  <  25.  (10) 

By  assumption  we  have  that 

||$(x-x*)||  >e||x-x*||,  (11) 

which  allows  us  to  conclude  that  ||x  —  x*||  <  ^.  □ 

Therefore,  we  need  only  concern  ourselves  with  estimating  the  minimum  value  of  for  non¬ 
zero  z  £  Ta(x*).  We  denote  this  quantity  as  the  minimum  gain  of  the  measurement  operator 
<h  restricted  to  the  cone  T a(x*).  In  particular  if  this  minimum  gain  is  bounded  away  from  zero, 
then  the  atomic  norm  heuristic  also  provides  robust  recovery  when  we  have  access  to  noisy  linear 
measurements  of  x*. 

2.5  Why  Atomic  Norm? 

The  atomic  norm  induced  by  a  set  A  possesses  a  number  of  favorable  properties  that  are  useful 
for  recovering  “simple”  models  from  limited  linear  measurements.  The  key  point  to  note  from 
Section  2.4  is  that  the  smaller  the  tangent  cone  at  a  point  x*  with  respect  to  conv(A),  the  easier 
it  is  to  satisfy  the  empty- intersection  condition  of  Proposition  2.1. 

Based  on  this  observation  it  is  desirable  that  points  in  conv(A)  with  smaller  tangent  cones  cor¬ 
respond  to  simpler  models,  while  points  in  conv(A)  with  larger  tangent  cones  generally  correspond 
to  more  complicated  models.  The  construction  of  conv(A)  by  taking  the  convex  hull  of  A  ensures 
that  this  is  the  case.  The  extreme  points  of  conv(A)  correspond  to  the  simplest  models,  i.e.,  those 
models  formed  from  a  single  element  of  A.  Further  the  low-dimensional  faces  of  conv(A)  consist 
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of  those  elements  that  are  obtained  by  taking  linear  combinations  of  a  few  basic  atoms  from  A. 
These  are  precisely  the  properties  desired  as  points  lying  in  these  low-dimensional  faces  of  conv(yl) 
have  smaller  tangent  cones  than  those  lying  on  larger  faces. 

We  also  note  that  the  atomic  norm  is  in  some  sense  the  best  possible  convex  heuristic  for 
recovering  simple  models.  Specifically  the  unit  ball  of  any  convex  penalty  heuristic  must  satisfy  a 
key  property:  the  tangent  cone  at  any  a  €  A  with  respect  to  this  unit  ball  must  contain  the  vectors 
a'  —  a  for  all  a!  €  A.  The  best  convex  penalty  function  is  one  in  which  the  tangent  cones  at  a  G  A 
to  the  unit  ball  are  the  smallest  possible,  while  still  satisfying  this  requirement.  This  is  because,  as 
described  above,  smaller  tangent  cones  are  more  likely  to  satisfy  the  empty  intersection  condition 
required  for  exact  recovery.  It  is  clear  that  the  smallest  such  convex  set  is  precisely  conv(yl),  hence 
implying  that  the  atomic  norm  is  the  best  convex  heuristic  for  recovering  simple  models. 

Our  reasons  for  proposing  the  atomic  norm  as  a  useful  convex  heuristic  are  quite  different  from 
previous  justifications  of  the  t\  norm  and  the  nuclear  norm.  In  particular  let  /  :  Rp  — >•  M  denote  the 
cardinality  function  that  counts  the  number  of  nonzero  entries  of  a  vector.  Then  the  t\  norm  is  the 
convex  envelope  of  /  restricted  to  the  unit  ball  of  the  norm,  i.e.,  the  best  convex  underestimator 
of  /  restricted  to  vectors  in  the  t'00-norm  ball.  This  view  of  the  t\  norm  in  relation  to  the  function 
/  is  often  given  as  a  justification  for  its  effectiveness  in  recovering  sparse  vectors.  However  if  we 
consider  the  convex  envelope  of  /  restricted  to  the  Euclidean  norm  ball,  then  we  obtain  a  very 
different  convex  function  than  the  t\  norm!  With  more  general  atomic  sets,  it  may  not  be  clear  a 
priori  what  the  bounding  set  should  be  in  deriving  the  convex  envelope.  In  contrast  the  viewpoint 
adopted  in  this  paper  leads  to  a  natural,  unambiguous  construction  of  the  norm  and  other  general 
atomic  norms.  Further  as  explained  above  it  is  the  favorable  facial  structure  of  the  atomic  norm 
ball  that  makes  the  atomic  norm  a  suitable  convex  heuristic  to  recover  simple  models,  and  this 
connection  is  transparent  in  the  definition  of  the  atomic  norm. 

3  Recovery  from  Generic  Measurements 

We  consider  the  question  of  using  the  convex  program  (5)  to  recover  “simple”  models  formed 
according  to  (1)  from  a  generic  measurement  operator  or  map  $  :  MP  — >•  Mn.  Specifically,  we  wish 
to  compute  estimates  on  the  number  of  measurements  n  so  that  we  have  exact  recovery  using  (5)  for 
most  operators  comprising  of  n  measurements.  That  is,  the  measure  of  n-measurement  operators 
for  which  recovery  fails  using  (5)  must  be  exponentially  small.  In  order  to  conduct  such  an  analysis 
we  study  random  Gaussian  maps  <h,  in  which  the  entries  are  independent  and  identically  distributed 
Gaussians.  These  measurement  operators  have  the  property  that  the  nullspace  null(<h)  is  uniformly 
distributed  among  the  set  of  all  (p  —  n)-dimensional  subspaces  in  Wp.  In  particular  we  analyze  when 
such  operators  satisfy  the  conditions  of  Proposition  2.1  and  Proposition  2.2  for  exact  recovery. 

3.1  Recovery  Conditions  based  on  Gaussian  Width 

Proposition  2.1  requires  that  the  nullspace  of  the  measurement  operator  <1  must  miss  the  tangent 
cone  Ta(x*).  Gordon  [31]  gave  a  solution  to  the  problem  of  characterizing  the  probability  that 
a  random  subspace  (of  some  fixed  dimension)  distributed  uniformly  misses  a  cone.  We  begin  by 
defining  the  Gaussian  width  of  a  set,  which  plays  a  key  role  in  Gordon’s  analysis. 

Definition  3.1.  The  Gaussian  width  of  a  set  S  C  Rp  is  defined  as: 

sup  g2z  , 

z  es 

where  g  ~  A/"(0, 1)  is  a  vector  of  independent  zero-mean  unit-variance  Gaussians. 


w(S)  :=  Eg 
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Gordon  characterized  the  likelihood  that  a  random  subspace  misses  a  cone  C  purely  in  terms  of 
the  dimension  of  the  subspace  and  the  Gaussian  width  w(C  n  §p_1),  where  Sp_1  C  is  the  unit 
sphere.  Before  describing  Gordon’s  result  formally,  we  introduce  some  notation.  Let  A*,  denote 
the  expected  length  of  a  /c-dimensional  Gaussian  random  vector.  By  elementary  integration,  we 
have  that  A/c  =  \/2r(^p)/r(f ).  Further  by  induction  one  can  show  that  A&  is  tightly  bounded  as 

-  Xk  ~  V^‘ 

The  main  idea  underlying  Gordon’s  theorem  is  a  bound  on  the  minimum  gain  of  an  operator 
restricted  to  a  set.  Specifically,  recall  that  null($)  n  T a(x*)  =  {0}  is  the  condition  required  for 
recovery  by  Proposition  2.1.  Thus  if  we  have  that  the  minimum  gain  of  <F  restricted  to  vectors  in 
the  set  TA(x*)  n  Sp_1  is  bounded  away  from  zero,  then  it  is  clear  that  null(<&)  n  T a(x*)  =  0.  We 
refer  to  such  minimum  gains  restricted  to  a  subset  of  the  sphere  as  restricted  minimum  singular 
values ,  and  the  following  theorem  of  Gordon  gives  a  bound  these  quantities  [31]: 

Theorem  3.2  (Gordon’s  Minimum  Restricted  Singular  Values  Theorem).  Let  11  be  a  closed  subset 
of  Sp_1.  Let  $  :  Mp  — >•  Mn  be  a  random  map  with  i.i.d.  zero-mean  Gaussian  entries  having  variance 
one.  Then  provided  that  A*,  >  w(Lt)  +  e,  we  have 

>  1  -  ^exp  (-^(Afc  -w({l)  -e)1 2^  .  (12) 

This  theorem  is  not  explicitly  stated  as  such  in  [31]  but  the  proof  follows  directly  as  a  result  of 
Gordon’s  arguments.  Theorem  3.2  allows  us  to  characterize  exact  recovery  in  the  noise-free  case 
using  the  convex  program  (5),  and  robust  recovery  in  the  noisy  case  using  the  convex  program  (7). 
Specifically,  we  consider  the  number  of  measurements  required  for  exact  or  robust  recovery  when 
the  measurement  map  <F  :  Mp  — >  Rn  consists  of  i.i.d.  zero-mean  Gaussian  entries  having  variance 
1/n.  The  normalization  of  the  variance  ensures  that  the  columns  of  $  are  approximately  unit- 
norm,  and  is  necessary  in  order  to  properly  define  a  signal-to-noise  ratio.  The  following  corollary 
summarizes  the  main  results  of  interest  in  our  setting: 

Corollary  3.3.  Let  :  Mp  — >•  Mn  be  a  random  map  with  i.i.d.  zero-mean  Gaussian  entries  having 
variance  1/n.  Further  let  11  =  T a(x*)  nSp_1  denote  the  spherical  part  of  the  tangent  cone  TA(yT). 

1.  Suppose  that  we  have  measurements  y  =  <f>x*,  and  we  solve  the  convex  program  (5).  Then 
x*  is  the  unique  optimum  of  (5)  with  high  probability  provided  that 

n  >  u;(ll)2  +  0(1). 


min  <Fz 

zefi 


>  e 


2  ^ 


2.  Suppose  that  we  have  noisy  measurements  y  =  <hx*  +  w,  with  the  noise  iv  bounded  as  ||w||  <  5, 
and  that  we  solve  the  convex  program  (7).  Letting  x  denote  the  optimal  solution  of  (7),  we 
have  that  ||x*  —  x||  <  with  high  probability  provided 


n  > 


w(Q)2 

(I^F 


+  0(1). 


Proof.  The  two  results  are  simple  consequences  of  Theorem  3.2: 

1.  The  first  part  follows  by  setting  e  =  0  in  Theorem  3.2. 

2.  For  e  €  (0, 1)  we  have  from  Theorem  3.2  that 


l|3(»)ll 


(13) 
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for  all  z  €  7^(x*)  with  high  probability.  Therefore  we  can  apply  Proposition  2.2  to  conclude 
that  || x  —  x*  ||  <  with  high  probability,  provided  that  n  >  +0(1). 


□ 


Gordon’s  theorem  thus  provides  a  simple  characterization  of  the  number  of  measurements  re¬ 
quired  for  reconstruction  with  the  atomic  norm.  Indeed  the  Gaussian  width  of  12  =  T a(x*)  n  gP^1 
is  the  only  quantity  that  we  need  to  compute  in  order  to  obtain  bounds  for  both  exact  and  robust 
recovery.  Unfortunately  it  is  in  general  not  easy  to  compute  Gaussian  widths.  Rudelson  and  Ver- 
shynin  [56]  have  worked  out  Gaussian  widths  for  the  special  case  of  tangent  cones  at  sparse  vectors 
on  the  boundary  of  the  t\  ball,  and  derived  results  for  sparse  vector  recovery  using  i\  minimization 
that  improve  upon  previous  results.  In  the  next  section  we  give  various  well-known  properties  of 
the  Gaussian  width  that  are  useful  in  some  computations.  In  Section  3.3  we  discuss  a  new  approach 
to  width  computations  that  gives  near-optimal  recovery  bounds  in  a  variety  of  settings. 


3.2  Properties  of  Gaussian  Width 

In  this  section  we  record  several  elementary  properties  of  the  Gaussian  width  that  are  useful  for 
computation.  We  begin  by  making  some  basic  observations,  which  are  easily  derived. 

First  we  note  that  the  width  is  monotonic.  If  Si  C  S2  C  Rp,  then  it  is  clear  from  the  definition 
of  the  Gaussian  width  that 

w(Si)  <  w(S2). 

Second  we  note  that  if  we  have  a  set  S  C  Mp,  then  the  Gaussian  width  of  S  is  equal  to  the  Gaussian 
width  of  the  convex  hull  of  S: 

w(S)  =  u>(conv(S)). 

This  result  follows  from  the  basic  fact  in  convex  analysis  that  the  maximum  of  a  convex  function 
over  a  convex  set  is  achieved  at  an  extreme  point  of  the  convex  set.  Third  if  V  C  lp  is  a  subspace 
in  Mp,  then  we  have  that 

w(V  n  gp_1)  =  \J dim(U), 

which  follows  from  standard  results  on  random  Gaussians.  This  result  also  agrees  with  the  intuition 
that  a  random  Gaussian  map  <F  misses  a  fc-dinrensional  subspace  with  high  probability  as  long  as 
dim  (null  ($))  >  k  +  1.  Finally,  if  a  cone  5  C  lp  is  such  that  S  =  £1  ©  S2,  where  S\  C  Mp  is  a 
^-dimensional  cone,  S'2  C  Rp  is  a  (p  —  fc)-dinrensional  cone  that  is  orthogonal  to  Si,  and  ©  denotes 
the  direct  sum  operation,  then  the  width  can  be  decomposed  as  follows: 

w(s  n  gp-1)2  <  to(Si  n  gp-1)2  +  w(S2  n  gp"1)2. 

These  observations  are  useful  in  a  variety  of  situations.  For  example  a  width  computation  that 
frequently  arises  is  one  in  which  S  =  Si  ©  S2  as  described  above,  with  Si  being  a  /c-dinrensional 
subspace.  It  follows  that  the  width  of  S  n  Sp_1  is  bounded  as 

w(S  n  sp-1)2  <  k  +  w{S2  n  gP"1)2. 

These  basic  operations  involving  Gaussian  widths  were  used  by  Rudelson  and  Vershynin  [56]  to 
compute  the  Gaussian  widths  of  tangent  cones  at  sparse  vectors  with  respect  to  the  i\  norm  ball. 

Another  tool  for  computing  Gaussian  widths  is  based  on  Dudley’s  inequality  [25,  42],  which 
bounds  the  width  of  a  set  in  terms  of  the  covering  number  of  the  set  at  all  scales. 
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Definition  3.4.  Let  S  be  an  arbitrary  compact  subset  of  Wp.  The  covering  number  of  S  in  the 
Euclidean  norm  at  resolution  e  is  the  smallest  number,  sJl(S,e),  such  that  Tl(S,e)  Euclidean  balls 
of  radius  e  cover  S. 

Theorem  3.5  (Dudley’s  Inequality).  Let  S  be  an  arbitrary  compact  subset  ofWp,  and  let  g  be  a 
random  vector  with  i.i.d.  zero-mean,  unit-variance  Gaussian  entries.  Then 

roo 

w(S)  <  24  /  y'logWS,  e))de.  (14) 

Jo 

We  note  here  that  a  weak  converse  to  Dudley’s  inequality  can  be  obtained  via  Sudakov’s  Mino- 
ration  [42]  by  using  the  covering  number  for  just  a  single  scale.  Specifically,  we  have  the  following 
lower  bound  on  the  Gaussian  width  of  a  compact  subset  S  C  Rp  for  any  e  >  0: 

w(S)  >  cev/log(01(5,e)). 

Here  c  >  0  is  some  universal  constant. 

Although  Dudley’s  inequality  can  be  applied  quite  generally,  estimating  covering  numbers  is  dif¬ 
ficult  in  most  instances.  There  are  a  few  simple  characterizations  available  for  spheres  and  Sobolev 
spaces,  and  some  tractable  arguments  based  on  Maurey’s  empirical  method  [42].  However  it  is  not 
evident  how  to  compute  these  numbers  for  general  convex  cones.  Also,  in  order  to  apply  Dudley’s 
inequality  we  need  to  estimate  the  covering  number  at  all  scales.  Further  Dudley’s  inequality  can 
be  quite  loose  in  its  estimates,  and  it  often  introduces  extraneous  polylogarithmic  factors.  In  the 
next  section  we  describe  a  new  mechanism  for  estimating  Gaussian  widths,  which  provides  near- 
optimal  guarantees  for  recovery  of  sparse  vectors  and  low-rank  matrices,  as  well  as  for  several  of 
the  recovery  problems  discussed  in  Section  3.4. 

3.3  New  Results  on  Gaussian  Width 

We  discuss  a  new  dual  framework  for  computing  Gaussian  widths.  In  particular  we  express  the 
Gaussian  width  of  a  cone  in  terms  of  the  dual  of  the  cone.  To  be  fully  general  let  C  be  a  non-empty 
convex  cone  in  Mp,  and  let  C*  denote  the  polar  of  C.  We  can  then  upper  bound  the  Gaussian  width 
of  any  cone  C  in  terms  of  the  polar  cone  C *: 

Proposition  3.6.  Let  C  be  any  non-empty  convex  cone  in  and  let  g  ~  AA(0, 1)  be  a  random 
Gaussian  vector.  Then  we  have  the  following  bound: 

wiCn^P-1)  <  Eg  [dist(g,C*)]  , 

where  dist  here  denotes  the  Euclidean  distance  between  a  point  and  a  set. 

The  proof  is  given  in  Appendix  A,  and  it  follows  from  an  appeal  to  convex  duality.  Propo¬ 
sition  3.6  is  more  or  less  a  restatement  of  the  fact  that  the  support  function  of  a  convex  cone  is 
equal  to  the  distance  to  its  polar  cone.  As  it  is  the  square  of  the  Gaussian  width  that  is  of  inter¬ 
est  to  us  (see  Corollary  3.3),  it  is  often  useful  to  apply  Jensen’s  inequality  to  make  the  following 
approximation: 

Eg[dist(g,C*)]2  <  Eg[dist(g,C*)2].  (15) 

The  inspiration  for  our  characterization  in  Proposition  3.6  of  the  width  of  a  cone  in  terms  of 
the  expected  distance  to  its  dual  came  from  the  work  of  Stojnic  [59],  who  used  linear  programming 
duality  to  construct  Gaussian-width-based  estimates  for  analyzing  recovery  in  sparse  reconstruction 
problems.  Specifically,  Stojnic’s  relatively  simple  approach  recovered  well-known  phase  transitions 
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in  sparse  signal  recovery  [23],  and  also  generalized  to  block  sparse  signals  and  other  forms  of 
structured  sparsity. 

This  new  dual  characterization  yields  a  number  of  useful  bounds  on  the  Gaussian  width,  which 
we  describe  here.  In  the  following  section  we  use  these  bounds  to  derive  new  recovery  results.  The 
first  result  is  a  bound  on  the  Gaussian  width  of  a  cone  in  terms  of  the  Gaussian  width  of  its  polar. 

Lemma  3.7.  Let  C  CM?  be  a  non-empty  closed,  convex  cone.  Then  we  have  that 

w(c  n  s^1)2  +  w(C*  n  s^1)2  <  P. 

Proof.  Combining  Proposition  3.6  and  (15),  we  have  that 

w(C  nSp_1)2  <  Eg  [dist(g,C*)2]  , 

where  as  before  g  ~  J\f{ 0, 1).  For  any  z  €  Mp  we  let  IIc(z)  =  arg  infug£  ||z  —  u||  denote  the  projection 
of  z  onto  C.  From  standard  results  in  convex  analysis  [55],  we  note  that  one  can  decompose  any 
zer  into  orthogonal  components  as  follows: 

z  =  nc(z)  +  nc*(z),  (nc(z),nc*(z))  =  o. 

Therefore  we  have  the  following  sequence  of  bounds: 

w(C  ns^"1)2  <  Eg  [dist(g,C*)2] 

=  Eg  [||nc(g)||2] 

=  Eg  [||g||2  -  ||nc*(g)ll2] 

=  p-Eg  [||nc*(g)||2] 

=  P-Eg  [dist(g,C)2] 

<  p-w^ngp-1)2. 

□ 

In  many  recovery  problems  one  is  interested  in  computing  the  width  of  a  self-dual  cone.  For 
such  cones  the  following  corollary  to  Lemma  3.7  gives  a  simple  solution: 

Corollary  3.8.  Let  C  CKP  be  a  self-dual  cone,  i.e.,  C  =  —C*.  Then  we  have  that 

w(C  nsp-1)2  <  |. 

Proof.  The  proof  follows  directly  from  Lemma  3.7  as  w(C  n  §p_1)2  =  iv(C*  n  Sp_1)2.  □ 

Our  next  bound  for  the  width  of  a  cone  C  is  based  on  the  volume  of  its  polar  C*  0  Sp_1.  The 
volume  of  a  measurable  subset  of  the  sphere  is  the  fraction  of  the  sphere  covered  by  the  subset. 
Thus  it  is  a  quantity  between  zero  and  one. 

Theorem  3.9  (Gaussian  width  from  volume  of  the  polar).  Let  C  C  Mp  be  any  closed,  convex,  solid 
cone,  and  suppose  that  its  polar  C*  is  such  that  C*  n  Sp_1  has  a  volume  of  0  G  [0, 1].  Then  for 
p  >  9  we  have  that 

w(C  n  s^-1)  <  3 
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The  proof  of  this  theorem  is  given  in  Appendix  B.  The  main  property  that  we  appeal  to  in 
the  proof  is  Gaussian  isoperimetry.  In  particular  there  is  a  formal  sense  in  which  a  spherical  cap1 
is  the  “extremal  case”  among  all  subsets  of  the  sphere  with  a  given  volume  0.  Other  than  this 
observation  the  proof  mainly  involves  a  sequence  of  integral  calculations. 

Note  that  if  we  are  given  a  specification  of  a  cone  C  C  MP  in  terms  of  a  membership  oracle, 
it  is  possible  to  efficiently  obtain  good  numerical  estimates  of  the  volume  of  C  n  Sp_1  [26].  More¬ 
over,  simple  symmetry  arguments  often  give  relatively  accurate  estimates  of  these  volumes.  Such 
estimates  can  then  be  plugged  into  Theorem  3.9  to  yield  bounds  on  the  width. 

3.4  New  Recovery  Bounds 

We  use  the  bounds  derived  in  the  last  section  to  obtain  new  recovery  results.  First  using  the  dual 
characterization  of  the  Gaussian  width  in  Proposition  3.6,  we  are  able  to  obtain  sharp  bounds  on  the 
number  of  measurements  required  for  recovering  sparse  vectors  and  low-rank  matrices  from  random 
Gaussian  measurements  using  convex  optimization  (i.e.,  G-norm  and  nuclear  norm  minimization). 

Proposition  3.10.  Let  x*  £  Mp  be  an  s-sparse  vector.  Letting  A  denote  the  set  of  unit- Euclidean- 
norm  one-sparse  vectors,  we  have  that 

w{Ta(x*))2  <  (2s  +  1)  log (p  -  s). 

Thus  (2s  +  l)log(p  —  s)  random  Gaussian  measurements  suffice  to  recover  x*  via  i\  norm  mini¬ 
mization  with  high  probability. 

Proposition  3.11.  Let  x*  be  an  m\  x  m2  rank-r  matrix  with  m\  <  m2.  Letting  A  denote  the  set 
of  unit- Euclidean-norm  rank-one  matrices,  we  have  that 

w(Tj i(x*))2  <  3r(mi  +  m2  —  r)  +  2(mi  —  r  —  r2). 

Thus  3?’(mi  +  m-2  —  r)  +  2(mi  —  r  —  r2)  random  Gaussian  measurements  suffice  to  recover  x*  via 
nuclear  norm  minimization  with  high  probability. 

The  proofs  of  these  propositions  are  given  in  Appendix  C.  The  number  of  measurements 
required  by  these  bounds  is  on  the  same  order  as  previously  known  results  [21,  12],  but  with 
improved  constants.  We  also  note  that  we  have  robust  recovery  at  these  thresholds.  Further  these 
results  do  not  require  explicit  recourse  to  any  type  of  restricted  isometry  property  [12],  and  the 
proofs  are  simple  and  based  on  elementary  integrals. 

Next  we  obtain  a  set  of  recovery  results  by  appealing  to  Corollary  3.8  on  the  width  of  a  self-dual 
cone.  These  examples  correspond  to  the  recovery  of  individual  atoms  (i.e.,  the  extreme  points  of 
the  set  conv(M)),  although  the  same  machinery  is  applicable  in  principle  to  estimate  the  number  of 
measurements  required  to  recover  models  formed  as  sums  of  a  few  atoms  (i.e.,  points  lying  on  low¬ 
dimensional  faces  of  conv(M)).  We  first  obtain  a  well-known  result  on  the  number  of  measurements 
required  for  recovering  sign-vectors  via  i ^  norm  minimization. 

Proposition  3.12.  Let  x*  £  {— 1,+1}P  be  a  sign-vector  in  and  let  A  be  the  set  of  all  such 
sign-vectors.  Then  we  have  that 

w(TA(^))2  <  |. 

Thus  f  random  Gaussian  measurements  suffice  to  recover  x*  via  ioo-norm  minimization  with  high 
probability. 

XA  spherical  cap  is  a  subset  of  the  sphere  obtained  by  intersecting  the  sphere  Sp_1  with  a  halfspace. 
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Proof.  The  tangent  cone  at  any  signed  vector  x*  with  respect  to  the  ball  is  a  rotation  of  the 
nonnegative  orthant.  Thus  we  only  need  to  compute  the  Gaussian  width  of  an  orthant  in  MP.  As 
the  orthant  is  self-dual,  we  have  the  required  bound  from  Corollary  3.8.  □ 


This  result  agrees  with  previously  computed  bounds  in  [45,  24],  which  relied  on  a  more  compli¬ 
cated  combinatorial  argument.  Next  we  compute  the  number  of  measurements  required  to  recover 
orthogonal  matrices  via  spectral-norm  minimization  (see  Section  2.2).  Let  Q (m)  denote  the  group 
of  m  x  m  orthogonal  matrices,  viewed  as  a  subgroup  of  the  set  of  nonsingular  matrices  in  Wnxm. 


Proposition  3.13.  Let  x*  £  Mmxm 
matrices.  Then  we  have  that 


be  an  orthogonal  matrix,  and  let  A  be  the  set  of  all  orthogonal 
3  m2  —  m 


w{TA{^)f  < 


Thus  3m4  —  random  Gaussian  measurements  suffice  to  recover  x*  via  spectral-norm  minimization 
with  high  probability. 


Proof.  Due  to  the  symmetry  of  the  orthogonal  group,  it  suffices  to  consider  the  tangent  cone  at  the 
identity  matrix  I  with  respect  to  the  spectral  norm  ball.  Recall  that  the  spectral  norm  ball  is  the 
convex  hull  of  the  orthogonal  matrices.  Therefore  the  tangent  space  at  the  identity  matrix  with 
respect  to  the  orthogonal  group  O (m)  is  a  subset  of  the  tangent  cone  TA(I).  It  is  well-known  that 
this  tangent  space  is  the  set  of  all  m  x  m  skew-symmetric  matrices.  Thus  we  only  need  to  compute 
the  component  S  of  TA(I)  that  lies  in  the  subspace  of  symmetric  matrices: 


S  =  con e{M  —  I :  \\M\\A  <  1,  M  symmetric} 

=  cone{17 DUT  —  UUT  :  ||D||_4  <  1,  D  diagonal,  U  £  O(m)} 
=  con e{U(D  —  I)UT  :  ||D||^  <  1,  D  diagonal,  U  £  O(m)} 

=  — PSDm. 


Here  PSDm  denotes  the  set  of  m  x  m  symmetric  positive-semidefinite  matrices.  As  this  cone  is 
self-dual,  we  can  apply  Corollary  3.8  in  conjunction  with  the  observations  in  Section  3.2  to  conclude 


that 


w(TA(I)f  < 


(™HCT) 


3  m2  —  m 


4 


□ 


We  note  that  the  number  of  degrees  of  freedom  in  an  m  x  m  orthogonal  matrix  (i.e.,  the 
dimension  of  the  manifold  of  orthogonal  matrices)  is  m  ^  .  Proposition  3.12  and  Proposition  3.13 
point  to  the  importance  of  obtaining  recovery  bounds  with  sharp  constants.  Larger  constants  in 
either  result  would  imply  that  the  number  of  measurements  required  exceeds  the  ambient  dimension 
of  the  underlying  x* .  In  these  and  many  other  cases  of  interest  Gaussian  width  arguments  not  only 
give  order-optimal  recovery  results,  but  also  provide  precise  constants  that  result  in  sharp  recovery 
thresholds. 

Finally  we  give  a  third  set  of  recovery  results  that  appeal  to  the  Gaussian  width  bound  of  Theo¬ 
rem  3.9.  The  following  measurement  bound  applies  to  cases  when  conv(A)  is  a  symmetric  polytope 
(roughly  speaking,  all  the  vertices  are  “equivalent”),  and  is  a  simple  corollary  of  Theorem  3.9. 

Corollary  3.14.  Suppose  that  the  set  A  is  a  finite  collection  of  m  points,  with  the  convex  hull 
conv(A)  being  a  vertex-transitive  polytope  [64]  whose  vertices  are  the  points  in  A.  Using  the  convex 
program  (5)  we  have  that  91og(m)  random  Gaussian  measurements  suffice,  with  high  probability, 
for  exact  recovery  of  a  point  in  A,  i.e.,  a  vertex  of  conv(A). 
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Proof.  We  recall  the  basic  fact  from  convex  analysis  that  the  normal  cones  at  the  vertices  of  a 
convex  polytope  in  Mp  provide  a  partitioning  of  Rp.  As  conv(A)  is  a  vertex-transitive  polytope, 
the  normal  cone  at  a  vertex  covers  A  fraction  of  Rp.  Applying  Theorem  3.9,  we  have  the  desired 
result.  □ 

Clearly  we  require  the  number  of  vertices  to  be  bounded  as  m  <  exp{|},  so  that  the  estimate  of 
the  number  of  measurements  is  not  vacuously  true.  This  result  has  useful  consequences  in  settings 
in  which  conv(A)  is  a  combinatorial  polytope ,  as  such  polytopes  are  often  vertex-transitive.  We  have 
the  following  example  on  the  number  of  measurements  required  to  recover  permutation  matrices: 

Proposition  3.15.  Let  x*  €  Mmxm  be  a  permutation  matrix,  and  let  A  be  the  set  of  all  m  x  m 
permutation  matrices.  Then  9mlog(m)  random  Gaussian  measurements  suffice,  with  high  proba¬ 
bility,  to  recover  x*  by  solving  the  optimization  problem  (5),  which  minimizes  the  norm  induced  by 
the  Birkhoff  polytope  of  doubly  stochastic  matrices. 

Proof.  This  result  follows  from  Corollary  3.14  by  noting  that  there  are  m\  permutation  matrices  of 
size  m  x  m.  □ 

4  Representability  and  Algebraic  Geometry  of  Atomic  Norms 

4.1  Role  of  Algebraic  Structure 

All  of  our  discussion  thus  far  has  focussed  on  arbitrary  atomic  sets  A.  As  seen  in  Section  2 
the  geometry  of  the  convex  hull  conv(A)  completely  determines  conditions  under  which  exact 
recovery  is  possible  using  the  convex  program  (5).  In  this  section  we  address  the  question  of 
computationally  representing  the  convex  hull  conv(A)  (or  equivalently  of  computing  the  atomic 
norm  ||  •  ||_4).  These  issues  are  critical  in  order  to  be  able  to  solve  the  convex  optimization  problem 
(5).  Although  the  convex  hull  conv(A)  is  a  well-defined  object,  in  general  we  may  not  even  be 
able  to  computationally  represent  it  (for  example,  if  A  is  a  fractal).  In  order  to  obtain  exact  or 
approximate  representations  (analogous  to  the  cases  of  the  norm  and  the  nuclear  norm)  it  is 
important  to  impose  some  structure  on  the  atomic  set  A.  We  focus  on  cases  in  which  the  set  A  has 
algebraic  structure.  Specifically  let  the  ring  of  multivariate  polynomials  in  p  variables  be  denoted 
by  M[x]  =  M[xi, . . . ,  xp].  We  then  consider  real  algebraic  varieties  [7]: 

Definition  4.1.  A  real  algebraic  variety  S  C  Mp  is  the  set  of  real  solutions  of  a  system  of  polynomial 
equations: 

S  =  {x:  gj  (x)  =  0,  Vj}, 

where  {gj}  is  a  finite  collection  of  polynomials  in  M[x]. 

Indeed  all  of  the  atomic  sets  A  considered  in  this  paper  are  examples  of  algebraic  varieties. 
Algebraic  varieties  have  the  remarkable  property  that  (the  closure  of)  their  convex  hull  can  be 
arbitrarily  well-approximated  in  a  constructive  manner  as  (the  projection  of)  a  set  defined  by  linear 
matrix  inequality  constraints.  A  potential  complication  may  arise,  however,  if  these  semidefinite 
representations  are  intractable  to  compute  in  polynomial  time.  In  such  cases  it  is  possible  to 
approximate  the  convex  hulls  via  a  hierarchy  of  tractable  semidefinite  relaxations.  We  describe 
these  results  in  more  detail  in  Section  4.2.  Therefore  the  atomic  norm  minimization  problems 
such  as  (7)  arising  in  such  situations  can  be  solved  exactly  or  approximately  via  semidefinite 
programming. 

Algebraic  structure  also  plays  a  second  important  role  in  atomic  norm  minimization  problems. 
If  an  atomic  norm  ||  •  ||_4  is  intractable  to  compute,  we  may  approximate  it  via  a  more  tractable  norm 
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Figure  2:  The  convex  body  given  by  the  dotted  line  is  a  good  metric  approximation  to  the  t\  ball. 
However  as  its  “corners”  are  “smoothed  out”,  the  tangent  cone  at  x*  goes  from  being  a  proper 
cone  (with  respect  to  the  l\  ball)  to  a  halfspace  (with  respect  to  the  approximation). 


II  •  1 1  app  ■  However  not  every  approximation  of  the  atomic  norm  is  equally  good  for  solving  inverse 
problems.  As  illustrated  in  Figure  2  we  can  construct  approximations  of  the  i\  ball  that  are  tight 
in  a  metric  sense,  with  (1  —  e)||  •  ||app  <  ||  •  <  (l  +  e)||  •  ||app,  but  where  the  tangent  cones  at  sparse 

vectors  in  the  new  norm  are  halfspaces.  In  such  a  case,  the  number  of  measurements  required  to 
recover  the  sparse  vector  ends  up  being  on  the  same  order  as  the  ambient  dimension.  (Note  that 
the  ^i-norin  is  in  fact  tractable  to  compute;  we  simply  use  it  here  for  illustrative  purposes.)  The 
key  property  that  we  seek  in  approximations  to  an  atomic  norm  ||  •  ||_4  is  that  they  preserve  algebraic 
structure  such  as  the  vertices/extreme  points  and  more  generally  the  low-dimensional  faces  of  the 
conv(A).  As  discussed  in  Section  2.5  points  on  such  low-dimensional  faces  correspond  to  simple 
models,  and  algebraic-structure  preserving  approximations  ensure  that  the  tangent  cones  at  simple 
models  with  respect  to  the  approximations  are  not  too  much  larger  than  the  corresponding  tangent 
cones  with  respect  to  the  original  atomic  norms. 

4.2  Semidefinite  Relaxations  using  Theta  Bodies 

In  this  section  we  give  a  family  of  semidefinite  relaxations  to  the  atomic  norm  minimization  problem 
whenever  the  atomic  set  has  algebraic  structure.  To  begin  with  if  we  approximate  the  atomic  norm 
II  '  11.4  by  another  atomic  norm  ||  •  ||^  defined  using  a  larger  collection  of  atoms  A  C  A,  it  is  clear 
that 


Consequently  outer  approximations  of  the  atomic  set  give  rise  to  approximate  norms  that  provide 
lower  bounds  on  the  optimal  value  of  the  problem  (5). 

In  order  to  provide  such  lower  bounds  on  the  optimal  value  of  (5),  we  discuss  semidefinite 
relaxations  of  the  convex  hull  conv(A).  All  our  discussion  here  is  based  on  results  described  in  [32] 
for  semidefinite  relaxations  of  convex  hulls  of  algebraic  varieties  using  theta  bodies.  We  only  give  a 
brief  review  of  the  relevant  constructions,  and  refer  the  reader  to  the  vast  literature  on  this  subject 
for  more  details  (see  [32,  50]  and  the  references  therein). 

To  begin  with  we  note  that  a  sum- of- squares  (SOS)  polynomial  in  M[x]  is  a  polynomial  that  can 
be  written  as  the  (finite)  sum  of  squares  of  other  polynomials  in  M[x].  Verifying  the  nonnegativity 
of  a  multivariate  polynomial  is  intractable  in  general,  and  therefore  SOS  polynomials  play  an 
important  role  in  real  algebraic  geometry  as  an  SOS  polynomial  is  easily  seen  to  be  nonnegative 
everywhere.  Further  checking  whether  a  polynomial  is  an  SOS  polynomial  can  be  accomplished 
efficiently  via  semidefinite  programming  [50]. 
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Turning  our  attention  to  the  description  of  the  convex  hull  of  an  algebraic  variety,  we  will 
assume  for  the  sake  of  simplicity  that  the  convex  hull  is  closed.  Let  I  C  M[x]  be  a  polynomial  ideal, 
and  let  Vr(I)  €  be  its  real  algebraic  variety: 

VR(I)  =  {x  :  /(x)  =  0,  V/  €  /}• 

One  can  then  show  that  the  convex  hull  conv(V]R(/))  is  given  as: 

conv(V]R(I))  =  {x  :  /(x)  >  0,  V/  linear  and  nonnegative  on  Vr(/)} 

=  {x:/(x)  >  0,  V/  linear  s.t.  f  =  h  +  g,  V  h  nonnegative,  V  g  £  1} 

=  {x  :  / (x)  >  0,  V/  linear  s.t.  /  nonnegative  modulo  I}. 

A  linear  polynomial  here  is  one  that  has  a  maximum  degree  of  one,  and  the  meaning  of  “modulo 
an  ideal”  is  clear.  As  nonnegativity  modulo  an  ideal  may  be  intractable  to  check,  we  can  consider 
a  relaxation  to  a  polynomial  being  SOS  modulo  an  ideal,  i.e.,  a  polynomial  that  can  be  written 
as  Yli= l  +  9  f°r  9  in  the  ideal.  Since  it  is  tractable  to  check  via  semidefinite  programnnning 
whether  bounded-degree  polynomials  are  SOS,  the  k- th  theta  body  of  an  ideal  I  is  defined  as  follows 
in  [32]: 

THfc(I)  =  {x  :  /(x)  >  0,  V/  linear  s.t.  /  is  k- sos  modulo  I}. 

Here  k- sos  refers  to  an  SOS  polynomial  in  which  the  components  in  the  SOS  decomposition  have 
degree  at  most  k.  The  fc-th  theta  body  THfc(I)  is  a  convex  relaxation  of  coiiv(Vr(/)),  and  one  can 
verify  that 

conv(FK(/))  C  •  •  •  C  THfc+1(J)  C  TH k(VR(I)). 

By  the  arguments  given  above  (see  also  [32])  these  theta  bodies  can  be  described  using  semidefinite 
programs  of  size  polynomial  in  k.  Hence  by  considering  theta  bodies  TH^(I)  with  increasingly 
larger  k,  one  can  obtain  a  hierarchy  of  tighter  semidefinite  relaxations  of  con v(Vr(/)).  We  also 
note  that  in  many  cases  of  interest  such  semidefinite  relaxations  preserve  low-dimensional  faces  of 
the  convex  hull  of  a  variety,  although  these  properties  are  not  known  in  general. 

Approximating  tensor  norms.  We  conclude  this  section  with  an  example  application  of 
these  relaxations  to  the  problem  of  approximating  the  tensor  nuclear  norm.  We  focus  on  the  case 
of  tensors  of  order  three  that  lie  in  Imxmxm]  i.e.,  tensors  indexed  by  three  numbers,  for  notational 
simplicity,  although  our  discussion  is  applicable  more  generally.  In  particular  the  atomic  set  A  is 
the  set  of  unit-Euclidean-norm  rank-one  tensors: 

A  =  {u  (g)  v  <g)  w  :  u,  v,  w  €  Rm,  ||u||  =  ||v||  =  ||w||  =  1} 

=  {N  €  M™3  :  IV  =  u  <8>  v  <8)  w,  u,  v,  w  €  Rm,  ||u||  =  ||v||  =  ||w||  =  1}, 

where  u<8>v<8)  w  is  the  tensor  product  of  three  vectors.  Note  that  the  second  description  is  written 
as  the  projection  onto  Mm  of  a  variety  defined  in  Mm  +3m.  The  nuclear  norm  is  then  given  by  (2), 
and  is  intractable  to  compute  in  general.  Now  let  I  a  denote  a  polynomial  ideal  of  polynomial  maps 
from  Rm3+m  to  R: 


m 

Ia  =  {9  ■  9  =  9ijk(Afijk-uivjwk)+gu(uTu-l)+gv(vTv-l)+gw(yvTw-l),ygijk,gu,gv,gw}. 

i,j,k=  1 

Here  gu,  9v,  9w,{9ijk}ij,k  are  polynomials  in  the  variables  N,  u,  v,w.  Following  the  program  de¬ 
scribed  above  for  constructing  approximations,  a  family  of  semidefinite  relaxations  to  the  tensor 
nuclear  norm  ball  can  be  prescribed  in  this  manner  via  the  theta  bodies  THfc(/^). 
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4.3  Tradeoff  between  Relaxation  and  Number  of  Measurements 

As  discussed  in  Section  2.5  the  atomic  norm  is  the  best  convex  heuristic  for  solving  ill-posed  linear 
inverse  problems  of  the  type  considered  in  this  paper.  However  we  may  wish  to  approximate  the 
atomic  norm  in  cases  when  it  is  intractable  to  compute  exactly,  and  the  discussion  in  the  preceding 
section  provides  one  approach  to  constructing  a  family  of  relaxations.  As  one  might  expect  the 
tradeoff  for  using  such  approximations,  i.e.,  a  weaker  convex  heuristic  than  the  atomic  norm,  is  an 
increase  in  the  number  of  measurements  required  for  exact  or  robust  recovery.  The  reason  for  this 
is  that  the  approximate  norms  have  larger  tangent  cones  at  their  extreme  points,  which  makes  it 
harder  to  satisfy  the  empty  intersection  condition  of  Proposition  2.1.  We  highlight  this  tradeoff 
here  with  an  illustrative  example  involving  the  cut  polytope. 

The  cut  polytope  is  defined  as  the  convex  hull  of  all  cut  matrices: 

V  =  conv{zzT  :  z  €  {— l,+l}m}. 

As  described  in  Section  2.2  low-rank  matrices  that  are  composed  of  ±l’s  as  entries  are  of  interest  in 
collaborative  filtering  [58],  and  the  norm  induced  by  the  cut  polytope  is  a  potential  convex  heuristic 
for  recovering  such  matrices  from  limited  measurements.  However  it  is  well-known  that  the  cut 
polytope  is  intractable  to  characterize  [20],  and  therefore  we  need  to  use  tractable  relaxations 
instead.  We  consider  the  following  two  relaxations  of  the  cut  polytope.  The  first  is  the  popular 
relaxation  that  is  used  in  semidehnite  approximations  of  the  MAXCUT  problem: 

V\  =  {M  :  M  symmetric,  M  >z  0,  Mu  =  1,  Vi  =  1,  •  •  •  ,p}. 

This  is  the  well-studied  elliptope  [20],  and  can  also  be  interpreted  as  the  second  theta  body  relax¬ 
ation  (see  Section  4.2)  of  the  cut  polytope  V  [32],  We  also  investigate  the  performance  of  a  second, 
weaker  relaxation: 


V-2  =  {M  :  M  symmetric,  Mu  =  1,  Vi,  | MtJ \  <  ±1,  Vi  /  j}. 

This  polytope  is  simply  the  convex  hull  of  symmetric  matrices  with  ±l’s  in  the  off-diagonal  entries, 
and  l’s  on  the  diagonal.  We  note  that  V2  is  an  extremely  weak  relaxation  of  V,  but  we  use  it  here 
only  for  illustrative  purposes.  It  is  easily  seen  that 

V  c  Vi  c  r2, 

with  all  the  inclusions  being  strict.  Figure  3  gives  a  toy  sketch  that  highlights  all  the  main  geometric 
aspects  of  these  relaxations.  In  particular  V\  has  many  more  extreme  points  that  V,  although  the 
set  of  vertices  of  V\ ,  i.e.,  points  that  have  full-dimensional  normal  cones,  are  precisely  the  cut 
matrices  (which  are  the  vertices  of  V)  [20].  The  convex  polytope  V2  contains  many  more  vertices 
compared  to  V  as  shown  in  Figure  3.  As  expected  the  tangent  cones  at  vertices  of  V  become 
increasingly  larger  as  we  use  successively  weaker  relaxations.  The  following  result  summarizes  the 
number  of  random  measurements  required  for  recovering  a  cut  matrix,  i.e.,  a  rank-one  sign  matrix, 
using  the  norms  induced  by  each  of  these  convex  bodies. 

Proposition  4.2.  Suppose  x*  €  Kmx™  is  a  rank-one  sign  matrix,  i.e.,  a  cut  matrix,  and  we  are 
given  n  random  Gaussian  measurements  ofx.*.  We  wish  to  recover  x*  by  solving  a  convex  program 
based  on  the  norms  induced  by  each  of  V ,V i,TV  We  have  exact  recovery  of  x*  in  each  of  these 
cases  with  high  probability  under  the  following  conditions  on  the  number  of  measurements: 

1.  Using  V:  n  =  0(m). 


20 


V 


'P\ 


Figure  3:  A  toy  sketch  illustrating  the  cut  polytope  V,  and  the  two  approximations  V\  and  TV 
Note  that  V\  is  a  sketch  of  the  standard  semidefinite  relaxation  that  has  the  same  vertices  as  V. 
On  the  other  hand  V2  is  a  polyhedral  approximation  to  V  that  has  many  more  vertices  as  shown 
in  this  sketch. 


2.  Using  V\:  n  =  0{m). 

3.  Using  V2:  n= 

Proof.  For  the  first  part,  we  note  that  V  is  a  symmetric  polytope  with  2m_1  vertices.  Therefore 

we  can  apply  Corollary  3.14  to  conclude  that  n  =  0(m )  measurements  suffices  for  exact  recovery. 

For  the  second  part  we  note  that  the  tangent  cone  at  x*  with  respect  to  the  nuclear  norm  ball 

ofmxm  matrices  contains  within  it  the  tangent  cone  at  x*  with  respect  to  the  polytope  V\.  Hence 

we  appeal  to  Proposition  3.11  to  conclude  that  n  =  0{m)  measurements  suffices  for  exact  recovery. 

Finally,  we  note  that  V2  is  essentially  the  hypercube  in  (™)  dimensions.  Appealing  to  Proposi- 

2 

tion  3.12,  we  conclude  that  n  =  m"~m  measurements  suffices  for  exact  recovery.  □ 

It  is  not  too  hard  to  show  that  these  bounds  are  order-optimal,  and  that  they  cannot  be 
improved.  Thus  we  have  a  rigorous  demonstration  in  this  particular  instance  of  the  fact  that  the 
number  of  measurements  required  for  exact  recovery  increases  as  the  relaxations  get  weaker  (and 
as  the  tangent  cones  get  larger).  The  principle  underlying  this  illustration  holds  more  generally, 
namely  that  there  exists  a  tradeoff  between  the  complexity  of  the  convex  heuristic  and  the  number 
of  measurements  required  for  exact  or  robust  recovery.  It  would  be  of  interest  to  quantify  this 
tradeoff  in  other  settings,  for  example,  in  problems  in  which  we  use  increasingly  tighter  relaxations 
of  the  atomic  norm  via  theta  bodies. 

We  also  note  that  the  tractable  relaxation  based  on  V\  is  only  off  by  a  constant  factor  with  re¬ 
spect  to  the  optimal  heuristic  based  on  the  cut  polytope  V.  This  suggests  the  potential  for  tractable 
heuristics  to  approximate  hard  atomic  norms  with  provable  approximation  ratios,  akin  to  meth¬ 
ods  developed  in  the  literature  on  approximation  algorithms  for  hard  combinatorial  optimization 
problems. 

4.4  Terracini’s  Lemma  and  Lower  Bounds  on  Recovery 

Algebraic  structure  in  the  atomic  set  A  provides  yet  another  interesting  insight,  namely  for  giving 
lower  bounds  on  the  number  of  measurements  required  for  exact  recovery.  The  recovery  condition 
of  Proposition  2.1  states  that  the  nullspace  null(<3?)  of  the  measurement  operator  4>  :  — >  1"  must 
miss  the  tangent  cone  Tj. i(x*)  at  the  point  of  interest  x*.  Suppose  that  this  tangent  cone  contains 
a  (/-dimensional  subspace.  It  is  then  clear  from  straightforward  linear  algebra  arguments  that  the 
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number  of  measurements  n  must  exceed  q.  Indeed  this  bound  must  hold  for  any  linear  measurement 
scheme.  Thus  the  dimension  of  the  subspace  contained  inside  the  tangent  cone  provides  a  simple 
lower  bound  on  the  number  of  linear  measurements. 

In  this  section  we  discuss  a  method  to  obtain  estimates  of  the  dimension  of  a  subspace  component 
of  the  tangent  cone.  We  focus  again  on  the  setting  in  which  A  is  an  algebraic  variety.  Indeed  in 
all  of  the  examples  of  Section  2.2,  the  atomic  set  A  is  an  algebraic  variety.  In  such  cases  simple 
models  x*  formed  according  to  (1)  can  be  viewed  as  elements  of  secant  varieties. 

Definition  4.3.  Let  A  G  Rp  be  an  algebraic  variety.  Then  the  k’th  secant  variety  Ak  is  defined  as 
the  union  of  all  affine  spaces  passing  through  any  k  +  1  points  of  A. 

Algebraic  geometry  has  a  long  history  of  investigations  of  secant  varieties,  as  well  as  tangent 
spaces  to  these  secant  varieties  [34],  In  particular  a  question  of  interest  is  to  characterize  the 
dimensions  of  secant  varieties  and  tangent  spaces.  In  our  context,  estimates  of  these  dimensions 
are  useful  in  giving  lower  bounds  on  the  number  of  measurements  required  for  recovery.  Specifically 
we  have  the  following  result,  which  states  that  certain  linear  spaces  must  lie  in  the  tangent  cone  at 
x*  with  respect  to  conv(*4): 

Proposition  4.4.  Let  A  C  Rp  be  a  smooth  variety,  and  let  T(u,A)  denote  the  tangent  space  at 
any  u  G  A  with  respect  to  A.  Suppose  x  =  Ya= i  c*a*>  ^a?:  €  A,  Ci  >  0,  such  that 

k 

IMU  =  ^2ci- 
1=1 

Then  the  tangent  cone  T^(x*)  contains  the  following  linear  space: 

T(ai,_4)  ©  ■  •  •  ©  T(a.k,A)  C  TA(x*), 
where  ©  denotes  the  direct  sum  of  subspaces. 

Proof.  We  note  that  if  we  perturb  ai  slightly  to  any  neighboring  so  that  a^  G  A,  then  the 
resulting  x'  =  cia^  +  Yli-2  c2ai  is  such  that  Hx']]^  <  ||x||_4.  The  proposition  follows  directly  from 
this  observation.  □ 

By  Terracini’s  lemma  [34]  from  algebraic  geometry  the  subspace  T(ai,„4)  ©  •  •  •  ©  T(&k,A)  is 
in  fact  the  estimate  for  the  tangent  space  T{x,Ak~l)  at  x  with  respect  to  the  ( k  —  l)’th  secant 
variety  A fc_1: 

Proposition  4.5  (Terracini’s  Lemma).  Let  A  C  Mp  be  a  smooth  affine  variety,  and  let  T(u,A) 
denote  the  tangent  space  at  any  u  €  A  with  respect  to  A.  Suppose  x  G  A k~1  is  a  generic  point  such 
that  x  =  Yli= i  c*ao  Veij  G  A,  Ci  >  0.  Then  the  tangent  space  T(x,J- 1^”1)  at  x  with  respect  to  the 
secant  variety  Ak^1  is  given  by  T{sl\,  A)  ®  ■  ■  ■  ®T{a.k,  A) .  Moreover  the  dimension  ofT(x,Ak~1) 
is  at  most  (and  is  expected  to  be)  min{p,  {k  +  l)dim(„4)  +  k}. 

Combining  these  results  we  have  that  estimates  of  the  dimension  of  the  tangent  space  ^(x,  Ak_1) 
lead  directly  to  lower  bounds  on  the  number  of  measurements  required  for  recovery.  The  intuition 
here  is  clear  as  the  number  of  measurements  required  must  be  bounded  below  by  the  number  of 
“degrees  of  freedom,”  which  is  captured  by  the  dimension  of  the  tangent  space  T(x,  However 

Terracini’s  lemma  provides  us  with  general  estimates  of  the  dimension  of  T^x,^’-1)  for  generic 
points  x.  Therefore  we  can  directly  obtain  lower  bounds  on  the  number  of  measurements,  purely 
by  considering  the  dimension  of  the  variety  A  and  the  number  of  elements  from  A  used  to  construct 
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x  (i.e. ,  the  order  of  the  secant  variety  in  which  x  lies).  As  an  example  the  dimension  of  the  base 
variety  of  normalized  order-three  tensors  in  ]^mxmxm  3 —  1).  Consequently  if  we  were  to  in 
principle  solve  the  tensor  nuclear  norm  minimization  problem,  we  should  expect  to  require  at  least 
0(km )  measurements  to  recover  a  rank- A:  tensor. 


5  Computational  Experiments 

5.1  Algorithmic  Considerations 

While  a  variety  of  atomic  norms  can  be  represented  or  approximated  by  linear  matrix  inequalities, 
these  representations  do  not  necessarily  translate  into  practical  implementations.  Semidefinite  pro¬ 
gramming  can  be  technically  solved  in  polynomial  time,  but  general  interior  point  solvers  typically 
only  scale  to  problems  with  a  few  hundred  variables.  For  larger  scale  problems,  it  is  often  preferable 
to  exploit  structure  in  the  atomic  set  A  to  develop  fast,  first-order  algorithms. 

A  starting  point  for  first-order  algorithm  design  lies  in  determining  the  structure  of  the  proximity 
operator  (or  Moreau  envelope)  associated  with  the  atomic  norm, 

ru(x;  n)  :=  arginin  i||z  -  x||2  +  h\\z\\a  .  (16) 


Here  /./  is  some  positive  parameter.  Proximity  operators  have  already  been  harnessed  for  fast 
algorithms  involving  the  t\  norm  [28,  15,  16,  33,  62]  and  the  nuclear  norm  [44,  10,  61]  where  these 
maps  can  be  quickly  computed  in  closed  form.  For  the  t\  norm,  the  zth  component  of  n_4(x;  \x)  is 
given  by 


hU(x;/i)i 


x,;  +  n  x,:  <  -/j, 

<  0  —  fJL  <  Xj  <  n  ■ 

„  X,;  -  /I  X,;  >  /i 


(17) 


This  is  the  so-called  soft  thresholding  operator.  For  the  nuclear  norm,  n_4  soft  thresholds  the 
singular  values.  In  either  case,  the  only  structure  necessary  for  the  cited  algorithms  to  converge 
is  the  convexity  of  the  norm.  Indeed,  essentially  any  algorithm  developed  for  t\  or  nuclear  norm 
minimization  can  in  principle  be  adapted  for  atomic  norm  minimization.  One  simply  needs  to  apply 
the  operator  n_4  wherever  a  shrinkage  operation  was  previously  applied. 

For  a  concrete  example,  suppose  /  is  a  smooth  function,  and  consider  the  optimization  problem 


min  f(x)  + n\\x\\A.  (18) 

X 

The  classical  projected  gradient  method  for  this  problem  alternates  between  taking  steps  along  the 
gradient  of  /  and  then  applying  the  proximity  operator  associated  with  the  atomic  norm.  Explicitly, 
the  algorithm  consists  of  the  iterative  procedure 


xfc+i  =  n^(xfc  -  afeV/(xfc);  ak\)  (19) 

where  {ag-}  is  a  sequence  of  positive  stepsizes.  Under  very  mild  assumptions,  this  iteration  can  be 
shown  to  converge  to  a  stationary  point  of  (18)  [29].  When  /  is  convex,  the  returned  stationary 
point  is  a  globally  optimal  solution.  Recently,  Nesterov  has  described  a  particular  variant  of  this 
algorithm  that  is  guaranteed  to  converge  at  a  rate  no  worse  than  0(k^1),  where  k  is  the  iteration 
counter  [49].  Moreover,  he  proposes  simple  enhancements  of  the  standard  iteration  to  achieve  an 
0(k~2)  convergence  rate  for  convex  /  and  a  linear  rate  of  convergence  for  strongly  convex  /. 
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(20) 


If  we  apply  the  projected  gradient  method  to  the  regularized  inverse  problem 

min  ||d>x  —  y||2  +  A||x||_4 

X 

then  the  algorithm  reduces  to  the  straightforward  iteration 

xfc+i  =  II^(xfc  +  afc$t(y  -  «f>xfc);  akX ) .  (21) 

Here  (20)  is  equivalent  to  (7)  for  an  appropriately  chosen  A  >  0  and  is  useful  for  estimation  from 
noisy  measurements. 

The  basic  (noiseless)  atomic  norm  minimization  problem  (5)  can  be  solved  by  minimizing  a 
sequence  of  instances  of  (20)  with  monotonically  decreasing  values  of  A.  Each  subsequent  mini¬ 
mization  is  initialized  from  the  point  returned  by  the  previous  step.  Such  an  approach  corresponds 
to  the  classic  Method  of  Multipliers  [5]  and  has  proven  effective  for  solving  problems  regularized 
by  the  t\  norm  and  for  total  variation  denoising  [63,  11]. 

This  discussion  demonstrates  that  when  the  proximity  operator  associated  with  some  atomic 
set  A  can  be  easily  computed,  then  efficient  first-order  algorithms  are  immediate.  For  novel  atomic 
norm  applications,  one  can  thus  focus  on  algorithms  and  techniques  to  compute  proximity  operators 
associated.  We  note  that,  from  a  computational  perspective,  it  may  be  easier  to  compute  the 
proximity  operator  via  dual  atomic  norm.  Associated  to  each  proximity  operator  is  the  dual 
operator 

Aa(x;h)  =  arg mindly  —  x||2  s.t.  \\y\\*A  <  n  (22) 

By  an  appropriate  change  of  variables,  A_4  is  nothing  more  than  the  projection  of  /i-1x  onto  the 
unit  ball  in  the  dual  atomic  norm: 


A^(x;/i)  =  argmin±||y  —  n  s.t.  Hy]]^  <  1 


(23) 


From  convex  programming  duality,  we  have  x  =  n^x;^)  +  A_4 (x;/x).  This  can  be  seen  by 
observing 

min  i||z  —  x||2  + /r||z||_4  =  min  max  ^||z  —  x||2  +  (y,  z)  (24) 

z  z  lly|IA<^ 

=  max  min  4||z  —  x||2  +  (y,  z)  (25) 

IlyllA^M  z 

=  max  — i||y  —  x||2  +  i||x||2  (26) 

l|y|IA<M  2  2 


In  particular,  n_4(x;  /x)  and  A_4(x;  n )  form  a  complementary  primal-dual  pair  for  this  optimization 
problem.  Hence,  we  only  need  to  able  to  efficiently  compute  the  Euclidean  projection  onto  the  dual 
norm  ball  to  compute  the  proximity  operator  associated  with  the  atomic  norm. 

Finally,  though  the  proximity  operator  provides  an  elegant  framework  for  algorithm  generation, 
there  are  many  other  possible  algorithmic  approaches  that  may  be  employed  to  take  advantage  of 
the  particular  structure  of  an  atomic  set  A.  For  instance,  we  can  rewrite  (22)  as 

A^(x; /x)  =  argmin  i||y  — /.i_1x||2  s.t.  (y,  a)  <  1  Vaed  (27) 


Suppose  we  have  access  to  a  procedure  that,  given  z  G  Mn,  can  decide  whether  (z,a)  <  1  for 
all  a  £  A,  or  can  find  a  violated  constraint  where  (z,  a)  >  1.  In  this  case,  we  can  apply  a 
cutting  plane  method  or  ellipsoid  method  to  solve  (22)  or  (6)  [48,  52],  Similarly,  if  it  is  simpler 
to  compute  a  subgradient  of  the  atomic  norm  than  it  is  to  compute  a  proximity  operator,  then 
the  standard  subgradient  method  [6,  48]  can  be  applied  to  solve  problems  of  the  form  (20).  Each 
computational  scheme  will  have  different  advantages  and  drawbacks  for  specific  atomic  sets,  and 
relative  effectiveness  needs  to  be  evaluated  on  a  case-by-case  basis. 
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Figure  4:  Plots  of  the  number  of  measurements  available  versus  the  probability  of  exact  recovery 
(computed  over  50  trials)  for  various  models. 


5.2  Simulation  Results 

We  describe  the  results  of  numerical  experiments  in  recovering  orthogonal  matrices,  permutation 
matrices,  and  rank-one  sign  matrices  (i.e.,  cut  matrices)  from  random  linear  measurements  by  solv¬ 
ing  convex  optimization  problems.  All  the  atomic  norm  minimization  problems  in  these  experiments 
are  solved  using  a  combination  of  the  SDPT3  package  [60]  and  the  YALMIP  parser  [43]. 

Orthogonal  matrices.  We  consider  the  recovery  of  20  x  20  orthogonal  matrices  from  random 
Gaussian  measurements  via  spectral  norm  minimization.  Specifically  we  solve  the  convex  program 
(5),  with  the  atomic  norm  being  the  spectral  norm.  Figure  4  gives  a  plot  of  the  probability  of  exact 
recovery  (computed  over  50  random  trials)  versus  the  number  of  measurements  required. 

Permutation  matrices.  We  consider  the  recovery  of  20  x  20  permutation  matrices  from 
random  Gaussian  measurements.  We  solve  the  convex  program  (5),  with  the  atomic  norm  being 
the  norm  induced  by  the  Birkhoff  polytope  of  20  x  20  doubly  stochastic  matrices.  Figure  4  gives 
a  plot  of  the  probability  of  exact  recovery  (computed  over  50  random  trials)  versus  the  number  of 
measurements  required. 

Cut  matrices.  We  consider  the  recovery  of  20  x  20  cut  matrices  from  random  Gaussian 
measurements.  As  the  cut  polytope  is  intractable  to  characterize,  we  solve  the  convex  program  (5) 
with  the  atomic  norm  being  approximated  by  the  norm  induced  by  the  semidefinite  relaxation  V\ 
described  in  Section  4.3.  Figure  4  gives  a  plot  of  the  probability  of  exact  recovery  (computed  over 
50  random  trials)  versus  the  number  of  measurements  required. 

In  each  of  these  experiments  we  see  agreement  between  the  observed  phase  transitions,  and  the 
theoretical  predictions  (Propositions  3.13,  3.15,  and  4.2)  of  the  number  of  measurements  required 
for  exact  recovery.  In  particular  note  that  the  phase  transition  in  Figure  4  for  the  number  of 
measurements  required  for  recovering  an  orthogonal  matrix  is  very  close  to  the  prediction  n  ~ 
3m  -m  _  2g5  of  Proposition  3.13.  We  refer  the  reader  to  [23,  54,  45]  for  similar  phase  transition  plots 
for  recovering  sparse  vectors,  low-rank  matrices,  and  signed  vectors  from  random  measurements 
via  convex  optimization. 
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6  Conclusions  and  Future  Directions 


This  manuscript  has  illustrated  that  for  a  fixed  set  of  base  atoms,  the  atomic  norm  is  the  best  choice 
of  a  convex  regularizer  for  solving  ill-posed  inverse  problems  with  the  prescribed  priors.  With  this 
in  mind,  our  results  in  Section  3  and  Section  4  outline  methods  for  computing  hard  limits  on  the 
number  of  measurements  required  for  recovery  from  any  convex  heuristic.  Using  the  calculus  of 
Gaussian  widths,  such  bounds  can  be  computed  in  a  relatively  straightforward  fashion,  especially 
if  one  can  appeal  to  notions  of  convex  duality  and  symmetry.  This  computational  machinery  of 
widths  and  dimension  counting  is  surprisingly  powerful:  near-optimal  bounds  on  estimating  sparse 
vectors  and  low-rank  matrices  from  partial  information  follow  from  elementary  integration.  Thus  we 
expect  that  our  new  bounds  concerning  symmetric,  vertex-transitive  polytopes  are  also  nearly  tight. 
Moreover,  algebraic  reasoning  allowed  us  to  explore  the  inherent  trade-offs  between  computational 
efficiency  and  measurement  demands.  More  complicated  algorithms  for  atomic  norm  regularization 
might  extract  structure  from  less  information,  but  approximation  algorithms  are  often  sufficient  for 
near  optimal  reconstructions. 

This  report  serves  as  a  foundation  for  many  new  exciting  directions  in  inverse  problems,  and 
we  close  our  discussion  with  a  description  of  several  natural  possibilities  for  future  work: 

Width  calculations  for  more  atomic  sets.  The  calculus  of  Gaussian  widths  described  in 
Section  3  provides  the  building  blocks  for  computing  the  Gaussian  widths  for  the  application 
examples  discussed  in  Section  2.  We  have  not  yet  exhaustively  estimated  the  widths  in  all  of  these 
examples,  and  a  thorough  cataloging  of  the  measurement  demands  associated  with  different  prior 
information  would  provide  a  more  complete  understanding  of  the  fundamental  limits  of  solving 
underdetermined  inverse  problems.  Moreover,  our  list  of  examples  is  by  no  means  exhaustive.  The 
framework  developed  in  this  paper  provides  a  compact  and  efficient  methodology  for  constructing 
regularizers  from  very  general  prior  information,  and  new  regularizers  can  be  easily  created  by 
translating  grounded  expert  knowledge  into  new  atomic  norms. 

Recovery  bounds  for  structured  measurements.  Our  recovery  results  focus  on  generic  mea¬ 
surements  because,  for  a  general  set  A,  it  does  not  make  sense  to  delve  into  specific  measurement 
ensembles.  Particular  structures  of  the  measurement  matrix  $  will  depend  on  the  application  and 
the  atomic  set  A.  For  instance,  in  compressed  sensing,  much  work  focuses  on  randomly  sampled 
Fourier  coefficients  [13]  and  random  Toeplitz  and  circulant  matrices  [35,  53].  With  low-rank  matri¬ 
ces,  several  authors  have  investigated  reconstruction  from  a  small  collection  of  entries  [14].  In  all  of 
these  cases,  some  notion  of  incoherence  plays  a  crucial  role,  quantifying  the  amount  of  information 
garnered  from  each  row  of  $.  It  would  be  interesting  to  explore  how  to  appropriately  generalize 
notions  of  incoherence  to  new  applications.  Is  there  a  particular  definition  that  is  general  enough 
to  encompass  most  applications?  Or  do  we  need  a  specialized  concept  to  match  the  specifics  of 
each  atomic  norm? 

Quantifying  the  loss  due  to  relaxation.  Section  4.3  illustrates  how  the  choice  of  approxima¬ 
tion  of  a  particular  atomic  norm  can  dramatically  alter  the  number  of  measurements  required  for 
recovery.  However,  as  was  the  case  for  vertices  of  the  cut  polytope,  some  relaxations  incur  only 
a  very  modest  increase  in  measurement  demands.  Using  techniques  similar  to  those  employed  in 
the  study  of  semidefinite  relaxations  of  hard  combinatorial  problems,  is  it  possible  to  provide  a 
more  systematic  method  to  estimate  the  number  of  measurements  required  to  recover  points  from 
polynomial-time  computable  norms? 
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Atomic  norm  decompositions.  While  the  techniques  of  Section  3  and  Section  4  provide  bounds 
on  the  estimation  of  points  in  low-dimensional  secant  varieties  of  atomic  sets,  they  do  not  provide 
a  procedure  for  actually  constructing  decompositions.  That  is,  we  have  provided  bounds  on  the 
number  of  measurements  required  to  recover  points  x  of  the  form 

x  =  J2  c&a 

aeT 

when  the  coefficient  sequence  {ca}  is  sparse,  but  we  do  not  provide  any  methods  for  actually 
recovering  c  itself.  These  decompositions  are  useful,  for  instance,  in  actually  computing  the  rank- 
one  binary  vectors  optimized  in  semidefinite  relaxations  of  combinatorial  algorithms  [30,  47,  2], 
or  in  the  computation  of  tensor  decompositions  from  incomplete  data  [40].  Is  it  possible  to  use 
algebraic  structure  to  generate  deterministic  or  randomized  algorithms  for  reconstructing  the  atoms 
that  underlie  a  vector  x,  especially  when  approximate  norms  are  used? 

Large-scale  algorithms.  Finally,  we  think  that  the  most  fruitful  extensions  of  this  work  lie  in 
a  thorough  exploration  of  the  empirical  performance  and  efficacy  of  atomic  norms  on  large-scale 
inverse  problems.  The  proposed  algorithms  in  Section  5  require  only  the  knowledge  of  the  proximity 
operator  of  an  atomic  norm,  or  a  Euclidean  projection  operator  onto  the  dual  norm  ball.  Using 
these  design  principles  and  the  geometry  of  particular  atomic  norms  should  enable  the  scaling  of 
atomic  norm  techniques  to  massive  data  sets. 
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A  Proof  of  Proposition  3.6 

Proof.  First  note  that  the  Gaussian  width  can  be  upper-bounded  as  follows: 


^(Cns^-1)  <  Eg 


T 

sup  g  Z 

zeCnB(o,i) 


(28) 


where  13(0, 1)  denotes  the  unit  Euclidean  ball.  The  expression  on  the  right  hand  side  inside  the 
expected  value  can  be  expressed  as  the  optimal  value  of  the  following  convex  optimization  problem 
for  each  g  £  Mp: 

maxz  g7  z 

s.t.  z  €  C  (29) 

INI2  <  i 

We  now  proceed  to  form  the  dual  problem  of  (29)  by  first  introducing  the  Lagrangian 

£(z,  u,  7)  =  gTz  +  7(1  -  zTz)  -  uTz 


where  u  £  C*  and  7  >  0  is  a  scalar.  To  obtain  the  dual  problem  we  maximize  the  Lagrangian  with 
respect  to  z,  which  amounts  to  setting 


z  = 


u). 
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Plugging  this  into  the  Lagrangian  above  gives  the  dual  problem 


min  7+4^l|g-u 
s.t.  u  €  C* 

7  >  0. 


Solving  this  optimization  with  respect  to  7  we  find  that  7  =  ^||g  — u||,  which  gives  the  dual  problem 
to  (29) 


mm  ||  g  —  u| 
s.t.  u  €  C* 


(30) 


Under  very  mild  assumptions  about  C ,  the  optimal  value  of  (30)  is  equal  to  that  of  (29)  (for  example 
as  long  as  C  has  a  non-empty  relative  interior,  strong  duality  holds).  Hence  we  have  derived 


Ee 


sup 

zeCnB(o,i) 


g  z 


=  Eg  [dist(g.O] 


(31) 


This  equation  combined  with  the  bound  (28)  gives  us  the  desired  result. 


□ 


B  Proof  of  Theorem  3.9 


Proof.  We  set  f3  =  First  note  that  if  (3  >  exp{|}  then  the  width  bound  exceeds  y/p,  which  is 
the  maximal  possible  value  for  the  width  of  C.  Thus,  we  will  assume  throughout  that  {3  <  exp{^}. 

Using  Proposition  3.6  we  need  to  upper  bound  the  expected  distance  to  the  polar  cone.  Let 
g  ~  J\f( 0, 1)  be  a  normally  distributed  random  vector.  Then  the  norm  of  g  is  independent  from  the 
angle  of  g.  That  is,  ||g||  is  independent  from  g/||g||.  Moreover,  g/||g||  is  distributed  as  a  uniform 
sample  on  §p_1,  and  Eg[||g||]  <  y/p.  Thus  we  have 

Eg [dist(g, C*)\  <  Eg [ 1 1 g 1 1  •  dist(g/||g||,C*  PlS^”1)]  <  VP®lu[dist(u, C*  n  Sp_1)]  (32) 

where  u  is  sampled  uniformly  on  §p_1. 

To  bound  the  latter  quantity,  we  will  use  isoperimetry.  Suppose  A  is  a  subset  of  Sp_1  and  B 
is  a  spherical  cap  with  the  same  volume  as  A.  Let  N(A,  r )  denote  the  locus  of  all  points  in  the 
sphere  of  Euclidean  distance  at  most  r  from  the  set  A.  Let  /j  denote  the  Haar  measure  on  Sp_1  and 
p(A-,r )  denote  the  measure  of  N(A,r).  Then  spherical  isoperimetry  states  that  p(A-r)  >  p(B;r) 
for  all  r  >  0  (see,  for  example  [41,  46]). 

Let  B  now  denote  a  spherical  cap  with  p{B)  =  p(C*  n  Sp_1).  Then  we  have 


Eu[dist(u,C*  n  §p-1)]  = 


< 


P[dist(u,C*  nSp  x)  >  t\dt 

(33) 

(1  —  p(C*  fl  Sp_1;  f))dt 

(34) 

(1  —  p(B;t))dt 

(35) 

J  0 


where  the  first  equality  is  the  integral  form  of  the  expected  value  and  the  last  inequality  follows  by 
isoperimetry.  Hence  we  can  bound  the  expected  distance  to  the  polar  cone  intersecting  the  sphere 
using  only  knowledge  of  the  volume  of  spherical  caps  on  . 
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To  proceed  let  v(ip)  denote  the  volume  of  a  spherical  cap  subtending  a  solid  angle  <p.  An  explicit 
formula  for  v((p)  is 

v(tp)  =  z~ 1  f  sin 1,-1  (d)dd  (36) 

Jo 

where  zp  =  J'J  sin p^1(d)dd  [38].  Let  tp(/3)  denote  the  minimal  solid  angle  of  a  cap  such  that  /3  copies 
of  that  cap  cover  §p_1.  Since  the  geodesic  distance  on  the  sphere  is  always  greater  than  or  equal 
to  Euclidean  distance,  if  K  is  a  spherical  cap  subtending  ip  radians,  p(K;t)  >  v(ip  + 1).  Therefore 

poo  poo 

/  (1  —  p{B\t))dt  <  /  (1  —  v((p((3)  +  t))dt .  (37) 

Jo  Jo 

We  can  proceed  to  simplify  the  right-hand-side  integral: 

pOO  pi T—(f(/3) 

/  (1  -v{<p{P)+t))dt=  (1  -  v(tp(/3)  +  t))dt  (38) 

Jo  Jo 

fK-ip{P) 

=  7r  —  ip(f3)  —  /  v(ip(/3)  +  t)dt  (39) 

Jo 

n r<p{P)+t 

=  7 r  —  tp(f3)  —  z~ 1  /  /  sinp_1  ddadt  (40) 

Jo  Jo 


sinp  1  ddadt 


=  7T  -  <p(P)  -  Zp 


sinp  1  ddtda 


'0  J  max($— <£>(/3),0) 


plT 

=  7 r  —  <^(/3)  —  Zp 1  /  {7T  —  (/?(/?)  —  max(t?  —  </?(/3),  0)}  sinp_1 tWa  (42) 
Jo 

pi r 

=  z' 1  /  max(il  —  ¥?(/?),  0)  sinp_1  dda  (43) 

Jo 

=  z^ 1  [  (d  —  sinp_1  dda  (44) 

•M/3) 

(41)  follows  by  switching  the  order  of  integration  and  the  rest  of  these  equalities  follow  by  straight¬ 
forward  integration  and  some  algebra. 

Using  the  inequalities  that  zp  >  -^=  (see  [41])  and  sin(x)  <  exp(— (.t  —  7t/2)2/2)  for  x  €  [0, 7r], 
we  can  bound  the  last  integral  as 


p  7T 

1  /  (i?  —  </>(/?))  sinp_1  dda  < 

J &(3) 


^2~  /  (^“^(^))ex P  §)2)  (45) 


Performing  the  change  of  variables  a  =  y/p  —  1(t?  —  f),  we  are  left  with  the  integral 
1  /'vT=TV2  r  a  /7r  xl  f  a2\  . 


1  I  a  /7r 

2  J ^/p— 1  (</?(/?)— 71-/2)  \  \/P  —  1  ^ 


2\^M 


s7^exp 


2  \  iPp-Itt/2 


I  vAp-l(v4/3)-7r/2) 


-  ¥>(/3) 


/•dp-  1tt/2 

wp-HviP)-*/2) 


da  (47) 


"(tt/2  -  ^(/?))2)  +  \j\  (|  -  V 


In  this  final  bound,  we  bounded  the  first  term  by  dropping  the  upper  integrand,  and  for  the  second 
term  we  used  the  fact  that 

poo 

/  exp(— x2 /2)dx  =  v/2tt  .  (49) 
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We  are  now  left  with  the  task  of  computing  a  lower  bound  for  (p(/3).  We  need  to  first  repa- 
rameterize  the  problem.  Let  K  be  a  spherical  cap.  Without  loss  of  generality,  we  may  assume 
that 

I<  =  {x  G  :  xi  >  h}  (50) 

for  some  h  G  [0, 1].  h  is  the  height  of  the  cap  over  the  equator.  Via  elementary  trigonometry,  the 
solid  angle  that  K  subtends  is  given  by  7t/2  —  sin_1(/i).  Hence,  if  h(/3)  is  the  largest  number  such 
that  (3  caps  of  height  h  cover  Sp_1,  then  h(/3)  =  sin(7r/2  —  4>(/3)). 

The  quantity  h(/3)  may  be  estimated  using  the  following  estimate  from  [9].  For  h  G  [0,1],  let 
7 (p,  h)  denote  the  volume  of  a  spherical  cap  of  Sp_1  of  height  h. 

Lemma  B.l  ([9]).  For  1  >  h  > 


Ioi^(1  -  <  7(p,ft)  <  -  A2)' 


(51) 


Note  that  for  h  >  -^L, 


So  if 


h  = 


'2  log  (4/3) 
P-  1 


(52) 


(53) 


then  h  <  1  because  we  have  assumed  /3  <  \  exp(4^3 ).  Moreover,  h  >  -^=  and  the  volume  of  the 
cap  with  height  h  is  less  than  or  equal  to  1  / (3.  That  is 


(p((3)  >  7t/2  —  sin  1 


2  log  (4/3)  \ 
P-1  ) 


(54) 


Combining  the  estimate  (48)  with  Proposition  3.6,  and  using  our  estimate  for  <^(/3),  we  get  the 
bound 


w(C)  < 


1 


P 


2  V  p-  1 


exp 


'Vsin  1 


21og(4/3)\ 
P-1  ) 


(55) 


This  expression  can  be  simplified  by  using  the  following  bounds.  First,  sin-1  (a;)  >  x  lets  us  upper 
bound  the  first  term  by  \J g^-  For  the  second  term,  using  the  inequality  sin~1(.x)  <  ^ x  results 
in  the  upper  bound 

w{c)  £  •  (se) 

For  p  >  9  the  upper  bound  can  be  expressed  simply  as  w(C)  <  3y/log(4 j3).  We  recall  that  f3  = 
which  completes  the  proof  of  the  theorem.  □ 
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C  Direct  Width  Calculations 


We  first  give  the  proof  of  Proposition  3.10. 


Proof.  Let  x*  be  a  s-sparse  vector  in  Wp  with  t\  norm  equal  to  1,  and  let  A  denote  the  set  of 
unit-Euclidean-norm  one-sparse  vectors.  Let  A  denote  the  set  of  coordinates  where  x*  is  non-zero. 
The  normal  cone  at  x*  with  respect  to  the  t\  ball  is  given  by 


IV 4(x*)  =  conejz  G  :  z*  =  sgn(x*)  for  i  £  A,  |z.;|  <  1  for  i  G  Ac}  . 


(57) 


Here  Ac  represents  the  zero  entries  of  x*.  Using  this  definition  the  minimum  squared  distance  to 
the  normal  cone  can  be  formulated  as  a  one-dimensional  convex  optimization  problem  for  arbitrary 
xGP 

\2 


mf  x  —  z  2  =  mf  V (x*  —  tsgn(x*))2  +  W  (x,-  —  z ~ 
zeNA(x*)  t>  o  ^ 

\zi\<t,  ieAc  je Ac 

“  ts^(xi))2  +  Y  shrink(xjU)2 


ie  A 


where 


je  Ac 

X  +  t 

x  <  —  t 

0 

—t<x<t 

x  —  t 

x  >  t 

shrink(x,t)  =  | 

is  the  ^-shrinkage  function.  Hence  for  any  fixed  t  >  0  independent  of  g,  we  have 
E, 


(58) 

(59) 

(60) 


inf  ||g  —  z| 

zeNA(x*) 


<  Eg 


J](gt  -  tsgn(x*))2  +  Y  shrink(g7- ,t)~ 
ie  A  je  Ac 


—  s(l  + 12)  +  Eg 


Y^  shrink(gj,  t)2 
jeAc 


(61) 

(62) 


Now  we  directly  integrate  the  second  term,  treating  each  term  individually.  For  a  zero-mean, 
unit  variance  normal  random  variable  g , 

1 


E9  [shrink^,  t)2  = 


r-t  i 

\2  -  -  x 


y/2 Jr  J- 

2 


(. g  +  t)  exp (-g  /2 )dg  + 


oo 

oo 


y/2 7T  . 


Jt 

2 


<  - 


2 

2 


(52  -  2 tg  +  t2)  exp(—g2 /2)dg 
(g  -  2t)  exp(-g2/2) 


t  y/2n 

2(1  + 12)  r°° 


\/2tt 


t.  exp(— 12/2)  + - /  exp(—g2 /2)dg 

V  27T  Jt 

t  +  (1  +  f2)^)  exp(— 12/2) . 


(ff  -  i)  exp(-c/  /2)dg  (63) 

(64) 

(65) 

(66) 

(67) 


°°  +  ^i±£l  /°°  exp(—g2 /2)dg 


The  first  simplification  follows  because  the  shrink  function  and  Gaussian  distributions  are  sym¬ 
metric  about  the  origin.  The  second  equality  follows  by  integrating  by  parts.  The  final  inequality 
follows  by  a  tight  bound  on  the  Gaussian  Q-function 

i  r°° 

Q(x)  =  —7E=  exp(-g2/2)dg  <  ±  exp(-z2/2).  (68) 

V  J  x 
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Using  this  bound  we  get 


E0 


inf  [|g  —  z| 


zeAU(x*) 


<  s(l  +  t 2)  +  ip  -  s)  (  (1  +  t 2)  -  -=t )  exp(-f2/2)  (69) 

V  27T 


Setting  t  =  ^2  log (jp  —  s)  —  1  gives 


Eo 


inf  ||  g  —  z| 

z  €JV,t(x*) 


<  (2s  +  1)  log(p  -  s) . 


provided  that  p  >  s  +  2. 


(70) 

□ 


Next  we  give  the  proof  of  Proposition  3.11. 

Proof.  Let  x*  be  an  m i  X  m2  matrix  of  rank  r  with  singular  value  decomposition  UXV* ,  and 
let  A  denote  the  set  of  rank-one  unit-Euclidean-norm  matrices  of  size  mi  x  m2.  Without  loss  of 
generality,  impose  the  conventions  mi  <  m2,  E  is  r  x  r,  U  is  mi  x  r,  V  is  m2  x  r,  and  assume  the 
nuclear  norm  of  x*  is  equal  to  1. 

Let  Ufc  (respectively  v*.)  denote  the  k' th  column  of  U  (respectively  V).  Also  assume  without 
loss  of  generality  that  mi  <  m2.  It  is  convenient  to  introduce  the  orthogonal  decomposition 
Rmi  xm2  =  A  0  A-1  where  A  is  the  linear  space  spanned  by  elements  of  the  form  u^,zT  and  xv[, 
1  <  k  <  r,  where  z  and  y  are  arbitrary,  and  is  the  orthogonal  complement  of  A.  The  space 
A-1  is  the  subspace  of  matrices  spanned  by  the  family  (xzT),  where  x  (respectively  z)  is  any  vector 
orthogonal  to  U  (respectively  V).  The  normal  cone  of  the  nuclear  norm  ball  at  x*  is  given  by  the 
cone  generated  by  the  subdifferential  at  x*: 

Na(x*)  =  cone{UUT  +  IU  G  RmiXm2  :  WTU  =  0,  WV  =  0,  \\W\\*A  <  l}  .  (71) 


Using  this  definition,  the  minimum  squared  distance  to  the  normal  cone  can  be  formulated  as  a 
one-dimensional  convex  optimization  problem  for  arbitrary  X  £  M"lXn2 

inf  ||A-Z||2=  inf  \\VA{X)  -  tUVT\\2  +  \\VA±  (A)  -  IU||2  (72) 

zeNx*  t>  0 

||w||^<t,  iveA-J- 

=  inf  \\Va(X)  -  tUVT ||2  +  1 1 shrinkCT (UAx (X),t)\\2  (73) 


where  shrink^  is  the  matrix  shrinkage  function.  Recall  that  if  X  admits  a  singular  value  decompo¬ 
sition  ASBt  and  S  has  the  vector  s  of  singular  values  on  its  main  diagonal,  then 

shrink0-(A,  t)  =  Adiag(shrink(s,  t))BT  (74) 


where  shrink  is  the  ordinary  i\  shrinkage  function  defined  in  (60)  applied  to  each  element  of  the 
vector  s. 

Let  G  be  a  Gaussian  random  matrix  with  i.i.d.  entries,  each  with  mean  zero  and  unit  variance. 
Then  for  any  fixed  t  >  0  independent  of  G ,  we  have 


Eg 


inf  II G-Z 


<  Eg  [\\Va(G)  -  tUVT ||2  +  ||shrinkCT(7V(G9,t)||2] 

=  r(mi  +  m2  —  r)  +  rt2  +  Eg  [shrinkcr(7:,Ax(G),  t)2]  . 


(75) 

(76) 
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To  bound  the  latter  expectation,  we  again  use  the  integral  form  of  the  expected  value 


Eg  [||shrink(T('PAx(G),t)||2]  =  /  P  [||shrinkCT(RAx(G'),f)||2  >  h]  dh 


/o 

poo 

<  P  [(mi  —  r)||shrinkCT(RAx(G),i)||2  >  h]  dh 

Jo 

roo 

'\\VA^G)\\>t- 


/ o 


h 


m  i  —  r 


dh 


(77) 

(78) 

(79) 


Now  VA±(G)  is  identically  distributed  as  an  (mi  —  r)  x  (m2  —  r)  matrix  with  i.i.d.  Gaussian  entries 
(due  to  the  isotropy  of  random  Gaussians),  each  with  mean  zero  and  variance  one.  We  thus  know 
that 

P  [\\VA±  (G)  ||  >  y/mi  -  r  +  y/m2  -  r  +  s]  <  exp  (-s2/ 2)  (80) 

(see,  for  example,  [17]).  Setting  t  =  \Jrri\  —  r  +  y/m?  —  r  gives  the  upper  bound 

Eg  [shrink^ (VAx  (G) ,  t)2]  <  J  exp  2(m^_r) )  dh  =  2(mx  -  r) .  (81) 

Thus,  with  the  same  value  of  t  in  (76),  we  get  that 


Eg 


inf  \\G-Z\\2 

zeNA( x*) 


<  r(mi  +  m2  —  r)  +  r{y/m\  —  r  +  y/m^-r)2  +  2(mi 

<  r(mi  +  m2  —  r)  +  2r(m\  +  m2  —  2r)  +  2(mi  —  r) 

=  3r(mi  +  m2  —  r)  +  2(mi  —  r  —  r2) 


r) 


where  the  second  inequality  follows  from  the  fact  that  (a  +  6)2  <  2a2  +  2 62 


(82) 

(83) 

(84) 

□ 
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